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'  Abstract 

The  residue  number  system,  or  RNS,  is  analyzed  in  detail  and  compared  against  a 
conventional  two’s  complement  system  for  the  problem  of  computing  Discrete  Fourier 
Transforms  (DFTs)  via  the  Winograd  Fourier  Transform  Algorithm  (WFTA.)  The 
analysis  shows  that  in  a  side-by-side  comparison,  the  size  and  speed  advantages  of 
RNS  cannot  compensate  for  the  high  overhead  required  for  conversion  and  scaling. 
The  residue  number  system,  or  RNS,  is  a  system  of  representing  integers  by  their 
remainders,  or  residues,  after  division  by  a  predetermined  set  of  relatively  prime  inte¬ 
gers.  Operations  such  as  addition,  subtraction,  and  multiplication  can  be  performed 
with  modular  arithmetic  on  these  residues  in  independent  channels,  such  that  it  is 
a  carry-free  system.  RNS  can  exploit  efficient  ROM  layouts  by  building  arithmetic 
units  out  of  ROMs.  The  main  disadvantage  of  RNS  is  that  scaling  and  conversion 
back  to  RNS  require  a  lengthy  series  of  operations.  The  WFTA  is  found  to  be  the 
best  algorithm  for  doing  DFTs  in  RNS  because  it  reduces  the  need  for  scaling  by 
nesting  the  necessary  multiplications  into  one  layer,  such  that  there  is  only  one  layer 

of  coefficients  that  increase  the  range  of  the  output  data.  s - 

An  area-time  metric  for  a  custom  VLSI  layout  is  used  to  compare  RNS  adders  and 
multipliers  to  conventional  binary  components.  It  is  shown  that  in  a  strict  side-by- 
side  comparison,  with  no  rounding  allowed,  RNS  can  outperform  two’s  complement 
significantly.  However,  when  RNS  is  compared  aagainst  a  two’s  complement  system 
not  constrained  by  the  RNS  requirement  to  avoid  scaling,  the  two’s  complement  can 
use  rounding  arithmetic  to  beat  RNS.  Since  the  WFTA  was  found  to  be  the  most  ideal 
algorithm  for  doing  DFTs  in  the  RNS,  it  is  concluded  that  RNS  does  not  provide  a  real 
advantage  for  doing  DFTs.  A  new  system  is  proposed,  in  which  DFTs  are  performed 
using  distributed  arithmetic  with  ROM  lookup  tables. 
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Chapter  1 
Introduction 


Computing  the  Discrete  Fourier  Transform  (DFT)  of  a  data  sequence  is  a  common 
operation  in  digital  signal  processing.  However,  the  time  required  to  perform  trans¬ 
forms  can  be  a  significant  bottleneck  in  applications  where  there  is  a  large  number 
of  data  points,  such  as  in  image  processing  and  radar  signal  processing.  There  exist 
many  algorithms  for  computing  DFTs,  the  most  common  being  the  Cooley-Tukey 
Fast  Fourier  Transform  (FFT)  algorithm.  Since  then,  however,  several  other  algo¬ 
rithms  have  been  developed  to  reduce  the  number  of  computations  required,  but 
with  a  less  regular  structure  than  the  FFT.  The  most  prominent  of  these  are  the 
Good-Winograd  Fourier  Transform  Algorithm  (GWFTA,  also  known  as  the  Prime 
Factor  Algorithm)  and  the  Winograd  Fourier  Transform  Algorithm  (WFTA),  which 
reduce  the  number  of  multiplies  required  in  the  transform  [3]. 

There  has  likewise  been  a  great  deal  of  research  done  on  specialized  high-speed 
architectures  for  performing  these  computations  [18,  1,  20,  9,  2].  Specifically,  there  has 
been  some  interest  in  doing  these  calculations  in  the  residue  number  system  (RNS), 
in  which  integers  are  represented  by  their  value  modulo  a  predetermined  set  of  small 
integers  [5,  6,  19].  This  number  system  has  been  shown  to  have  properties  that  result 
in  significant  speed  advantages  in  digital  signal  processing  applications,  where  most 
of  the  computations  required  are  multiplications  and  additions. 
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This  thesis  will  analyze  and  compare  RNS  and  two’s  complement  systems  in  terms 
of  size  and  speed  for  a  custom  layout  to  determine  if  RNS  is  inherently  better  for  solv¬ 
ing  the  DFT  problem.  Previous  work  has  emphasized  the  speed  advantage  of  RNS 
over  conventional  binary  arithmetic,  ignoring  the  increased  speed  possible  in  conven¬ 
tional  binary  arithmetic  through  the  use  of  pipelined  or  parallel  VLSI  components. 
Therefore,  an  area-time  product  is  a  better  metric  for  comparison  and  will  be  used  in 
this  thesis.  This  chapter  will  introduce  the  residue  number  system,  and  explain  why 
it  may  be  better  for  performing  DFTs  and  what  its  disadvantages  are.  Chapter  2 
explains  in  more  detail  the  requirements  for  doing  arithmetic  in  the  RNS  and  the 
parts  of  such  a  system.  Chapter  3  discusses  the  choice  of  using  an  area-time  product 
to  evaluate  computational  systems,  and  it  compares  two’s  complement  components 
to  RNS  components  in  order  to  determine  the  relative  advantage  of  RNS  at  the  ad- 
dder  and  multiplier  level.  Chapter  4  explains  the  three  most  efficient  algorithms  for 
computing  DFTs  and  describes  how  the  WFTA  is  the  most  ideal  algorithm  for  RNS. 
Chapter  5  uses  the  area-time  metric  to  compare  two’s  complement  and  RNS  systems 
designed  for  doing  different  size  WFTAs.  The  conclusions,  summed  up  in  Chapter  6, 
are  that  RNS  does  not  provide  a  clear  advantage,  in  terms  of  the  area-time  metric, 
over  two’s  complement  for  the  DFT  problem.  Other  approaches  to  the  problem  and 
applications  of  the  RNS  are  suggested  for  further  research. 


1.1  The  RNS  in  Digital  Signal  Processing 

The  residue  number  system,  or  RNS,  is  an  integer  coding  system  which  has  been 
shown  to  have  potential  speed  advantages  in  systems  designed  for  digital  signal 
processing.  A  residue  system  represents  an  integer  by  a  number  of  remainders,  or 
residues,  after  division  of  the  integer  by  a  set  of  given  integers.  This  is  also  called 
modular  arithmetic  because  the  remainder  is  the  representation  of  the  integer  in  a 
modulo  n  system,  where  n  is  the  number  by  which  the  integer  is  divided.  For  exam- 
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pie,  23  could  be  represented  by  2,  3,  and  2,  which  result  after  division  by  3,  5,  and 
7,  respectively.  RNS  arithmetic  is  not  new;  in  fact,  the  previous  example  is  taken 
from  Suan-ching,  written  by  Sun  Tzu  in  the  third  century  [16].  His  work  describes 
a  method  for  converting  the  residues  back  to  the  integer.  Today,  this  procedure  is 
appropriately  termed  the  Chinese  Remainder  Theorem. 

The  potential  advantage  of  RNS- based  digital  systems  is  that  once  integers  are 
represented  by  their  remainders  modulo  a  given  set  of  integers,  operations  such  as 
addition,  subtraction,  and  multiplication  can  be  performed  independently  and  in 
parallel  on  the  different  remainders.  This  arithmetic  is  done  modulo  n,  where  n  is 
the  integer  upon  which  the  corresponding  residues  are  based.  Since  a  large  integer 
can  be  represented  by  several  smaller  residues,  and  the  arithmetic  operations  on  these 
residues  can  be  performed  independently  and  in  parallel,  the  RNS  has  potential  speed 
advantages  because  the  residue  digits  are  smaller  yet  they  require  no  carries  between 
them.  The  speed  advantages  are  most  promising  in  digital  signal  processing  where 
the  RNS-compatible  operations  of  addition,  subtraction,  and  multiplication  are  most 
common. 

A  residue  number  system  is  defined  by  a  set  of  integers  {m1?  m2, . . . ,  mi)  which 
are  pairwise  relatively  prime.  (The  greatest  common  factor  of  any  two  is  1.)  Any 
integer  X  in  the  range  [0,M  —  1],  where  M  =  mx  ■  m2  •  . . .  •  mi,  can  be  uniquely 
represented  by  the  set  of  residues  {ij,  x2, . . . ,  xi),  where 

x,  =  X  mod  m,  =  |X|mi,t  =  1  ,...,£.  (1.1) 

Soderstrand  et  al.  explain  that  for  a  signed  number  system,  the  legitimate  range 
[0,  M  —  1]  is  divided  into  positive  and  negative  regions.  For  M  odd,  the  permit¬ 
ted  range  is  [—{M  —  l)/2 ,(M  —  l)/2]  with  negative  integers  mapped  to  [(M  + 
l)/2,  M  —  1]  of  the  legitimate  range  above.  Likewise,  for  M  even,  the  permitted 
range  is  [—M/2,  ( M/2  —  1)]  with  negatives  mapped  to  [M/2,  M  —  1]  of  the  legitimate 
range  [12]. 

Arithmetic  operations  in  RNS  are  very  simple  for  addition,  subtraction,  and  mul- 
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tiplication.  If  X ,  Y,  and  Z  are  represented  in  the  RNS  by 


X  =>  {xi,...,x£} 

y  =>  {2/1,  -  -  - ,  yz.} 

Z  =>  {zi,...,Z£,} 

then  the  operation  X  *  Y  =  Z  maps  to 

(*,•  *  3/i)  mod  m,  =  z,-,  i  =  1, L , 

where  *  represents  addition,  subtraction,  or  multiplication.  General  division  is  not 
possible  because  the  set  of  integers  is  not  closed  over  division.  Scaling  a  number  in 
the  RNS  will  be  discussed  later.  The  residue  digit  results  alone  are  not  very  useful 
until  they  are  converted  back  to  an  integer.  There  are  two  well  known  techniques 
for  converting  the  residue  digits  back  to  an  integer.  They  both  involve  quite  a  few 
operations  and  will  also  be  discussed  later. 

Techniques  for  implementing  RNS-based  systems  have  been  studied  in  great  de¬ 
tail  [12].  There  are  generally  three  stages:  converting  data  to  its  RNS  equivalent, 
performing  the  necessary  computations  to  process  the  data,  and  converting  the  re¬ 
sults  back  to  radix-2  binary  numbers.  The  first  operation,  converting  to  a  residue 
representation,  is  discussed  by  Jenkins  in  his  description  of  an  RNS  digital  filter.  The 
conversion  is  done  most  easily  with  a  ROM  lookup  table  for  each  modulus  [7].  The 
size  of  the  table  is  within  reason  because  the  number  of  possible  entries  is  limited  to 
the  size  of  the  modulus. 

Many  systems  have  been  designed  for  implementing  the  two  basic  RNS  opera¬ 
tions  of  addition  and  multiplication. [12,  Part  III]  One  of  the  simplest  ways  of  doing 
addition  is  again  to  use  a  memory  lookup  table,  where  the  address  is  formed  by  the 
concatenation  of  the  two  residues.  The  memory  will  have  22n  words  with  n  bits  per 
word,  where  n  is  the  number  of  bits  in  each  residue.  Soderstrand  has  explained  how 
the  size  of  the  memory  can  be  reduced  if  the  addition  is  done  by  first  adding  the 
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residue  digits  with  a  binary  adder  and  then  using  a  memory  with  ann  +  1  bit  address 
to  correct  the  result  for  the  given  modulus.  Other  systems  eliminate  the  need  for  the 
extra  memory  hardware  by  using  normal  binary  adders  and  a  correction  factor  to  get 
the  RNS  result  [11].  This  is  done  most  easily  if  moduli  are  of  the  form  2n,  2n  +  1,  and 
2n  —  1  because  of  their  proximity  to  a  power  of  2. 

Multiplication  in  RNS  can  be  just  as  simple  as  addition.  If  lookup  tables  are 
used  for  addition,  they  can  also  be  used  for  multiplication,  so  that  the  necessary 
hardware  and  speed  will  be  the  same  as  for  addition.  Soderstrand  [13]  and  Jullien  [8] 
present  two  other  approaches,  which  decrease  the  size  of  the  memory  lookup  tables 
in  exchange  for  some  additions.  A  fourth  way  of  doing  multiplication  is  a  more 
conventional  bitwise  multiplication.  The  multiplicand  is  multiplied  successively  by 
each  bit  of  the  multiplier  and  the  result  is  doubled  and  added  to  the  result  from  the 
next  least  significant  bit.  This  method  requires  modular  adders  and,  therefore,  is 
more  simple  if  the  moduli  are  of  the  form  2n,  2n  +  1,  and  2n  —  1. 

Two  well-known  methods  exist  for  converting  residues  back  to  integers.  Szabo 
and  Tanaka  describe  them  in  their  comprehensive  text  on  RNS  [14].  The  first  is  the 
Chinese  Remainder  Theorem: 

X  =  V]  (  m*  ~  )  mod  M,  where  m,  =  M/m,.  (1.2) 

i=l  V  171 <  m,/ 

Although  this  is  the  shorter  of  the  two,  it  requires  the  use  of  a  modulo-M  adder, 
where  M  is  the  dynamic  range.  The  other  method  is  to  do  a  mixed-radix  conversion 
on  the  RNS  digits.  An  integer  is  represented  in  the  mixed-radix  system  by 

L- 1 

X  =  a/v  JJ  m,  +  ...  +  a3mim2  +  a2mi  +  oi,  (1.3) 

i=i 

where  the  a,’s  are  called  the  mixed-radix  coefficients.  These  coefficients  cam  be  found 
through  a  sequence  of  L  —  1  subtractions  and  L  —  1  multiplications  on  the  original 
residues.  The  integer  X  is  found  by  adding  the  terms  in  the  above  expression.  Any 
necessary  division  or  scaling  in  the  RNS  can  also  be  performed  by  converting  to  the 
mixed-radix  form  since  it  is  a  weighted  number  representation. 
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Scaling  is  a  problem  which  must  be  dealt  with  in  RNS  systems  because  of  the 
potential  problem  of  exceeding  the  dynamic  range  of  a  system.  The  problem  is  that  in 
RNS,  the  magnitude  of  an  integer  depends  on  all  its  residues  together,  so  each  residue 
digit  cannot  be  scaled  independently.  Scaling  is  different  from  general  division  because 
it  implies  division  by  a  given  constant  and  rounding.  Szabo  and  Tanaka  describe  the 
procedure  most  commonly  used,  which  is  to  scale  by  one  of  the  moduli  [14].  In 
order  to  ensure  that  the  rounded  division  will  result  in  an  integer,  the  remainder 
for  the  modulus  by  which  the  integer  is  being  scaled  is  subtracted  from  the  RNS 
representation.  The  RNS  digits  are  then  multiplied  by  the  multiplicative  inverse  of  the 
settling  modulus  and  a  base  extension  algorithm,  similar  to  mixed-radix  conversion,  is 
used  to  determine  the  new  residue  for  the  scaling  modulus.  This  algorithm  requires 
significantly  fewer  computations  if  the  moduli  are  of  the  popular  2n,  2"  +  1,  and  2n  —  1 
form. 

The  speed  advantages  of  RNS  arithmetic  extend  from  the  fact  that  operations  on 
the  residues  can  be  carried  out  independently.  Since  the  residue  digits  are  limited  by 
the  size  of  the  moduli  and  there  is  no  carry  information  sent  from  one  residue  digit 
to  another,  residue  systems  should  perform  faster  than  conventional  radix-2  systems 
due  to  the  reduction  of  carry  propagations.  The  drawbacks  are  that 

1.  general  division  is  not  possible  since  division  is  not  closed  over  the  set  of  integers, 

2.  scaling,  a  more  specific  type  of  division,  is  very  inefficient,  and 

3.  some  time  must  be  invested  in  converting  integers  to  and  from  the  residue 
system. 

Hence,  the  speed  advantage  of  RNS  is  most  promising  when  the  number  of  simple 
arithmetic  operations  required  by  the  application  algorithm  is  large  with  respect  to 
any  required  scaling  or  conversion  operations. 
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1.2  Computing  DFTs  with  RNS  Arithmetic 


Using  RNS  arithmetic  to  compute  DFTs  raises  issues  that  are  not  so  great  a  concern 
with  a  radix-2  weighted  number  system.  The  greatest  of  these  issues  is  the  problem  of 
scaling,  because  RNS  is  a  system  for  encoding  integers.  Any  DFT  algorithm  requires 
the  multiplication  of  the  input  data  by  twiddle  factors  of  the  form  t~j2$nk ,  whose 
real  and  imaginary  parts  are  a  cosine  and  a  sine.  These  values  are  always  less  than 
1,  so  that  encoding  them  in  RNS  requires  scaling  them  up  by  the  number  of  bits 
of  desired  resolution.  With  the  twiddle  factors  all  represented  by  integers  greater 
than  1,  the  dynamic  range  grows  with  the  transform  size.  A  residue  system  would 
therefore  require  a  dynamic  range  larger  than  that  of  the  original  data  to  allow  for 
integer  growth  before  scaling.  This  is  not  a  problem  in  a  radix-2  system  because  any 
desired  scaling  and  truncation  can  be  done  by  shifting  the  decimal  point  and  throwing 
away  the  least  significant  digits  to  make  room  for  the  digits  with  a  greater  weight.  As 
explained  in  the  previous  section,  however,  scaling  RNS  values  requires  a  significant 
number  of  computations. 

The  size  of  this  scaling  problem  depends  on  the  actual  DFT  algorithm  and  moduli 
set  being  used.  Taylor  has  shown  that  the  WFTA  would  be  most  suitable  for  RNS 
because  it  reduces  the  number  of  multiplies  to  a  minimum  and,  hence,  reduces  the 
number  of  scaling  operations  [17].  Whether  or  not  the  RNS  can  improve  the  sarea- 
time  product  of  DFT  calculations  is  a  question  that  will  be  addressed  in  this  thesis. 

There  are  two  main  problems  with  determining  whether  or  not  an  RNS-based 
design  can  outperform  a  similar  system  using  conventional  arithmetic.  The  first  is 
that  there  is  no  standard,  optimal  design  for  a  system  that  computes  DFTs.  Most 
of  these  specialized  hardware  designs  are  highly  pipelined,  but  other  than  that,  they 
use  many  different  architectures  and  algorithms  depending  on  the  application  and  the 
desired  performance.  This  leads  to  the  second  problem,  in  that  there  is  no  typical 
application  with  a  set  of  requirements  that  is  representative  of  all  systems  requiring 
the  use  of  DFTs.  DFTs  vary  in  the  size  of  the  transform  and  in  the  desired  accuracy 
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and  speed  of  the  computation.  Accordingly,  a  system  can  be  designed  to  optimize 
its  performance  for  a  given  application,  even  though  it  may  be  quite  inefficient  for 
another. 

For  these  reasons,  the  first  goal  of  this  thesis  is  to  show  that  the  area-time  metric 
represents  the  most  significant  constraints  on  systems  used  to  compute  DFTs.  These 
constraints  include  speed,  size  (in  terms  of  silicon  area),  and  power  as  a  function  of 
the  transform  length  and  the  desired  accuracy.  The  performance  of  the  best  algorithm 
and  hardware  implementation  in  a  standard  technology  base  will  be  evaluated  using 
this  metric  to  determine  performance  characteristics  of  specific  DFTs  implemented 
with  an  RNS  architecture. 
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Chapter  2 


Using  the  RNS  in  Digital  Signal 
Processing 


This  chapter  will  expand  on  the  requirements  for  doing  digital  signal  processing  in 
the  residue  number  system.  As  explained  in  Chapter  1,  there  are  three  parts  to  an 
RNS  system: 

1.  conversion  of  data  from  two’s  complement  or  other  binary  format  to  RNS, 

2.  processing  the  data  in  the  RNS,  and 

3.  conversion  of  the  data  from  RNS  to  original  binary  format. 

These  three  parts  of  the  problem  will  be  addressed  before  examining  how  the  RNS 
compares  to  conventional  two’s  complement  arithmetic. 


2.1  Conversion  From  Two’s  Complement  to  RNS 

The  first  part  of  the  processing  in  RNS  is  to  calculate  the  residue  representation  of 
each  piece  of  data  for  the  set  of  moduli  being  used.  Since  the  ultimate  goal  of  using 
the  RNS  is  to  increase  the  speed  of  a  system,  this  conversion  must  be  done  as  quickly 
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as  possible.  Jenkins  discusses  a  common  and  efficient  solution,  which  is  to  use  high¬ 
speed  ROMs  to  look  up  the  residues  for  each  modulus  independently.  For  relatively 
small  moduli,  each  of  these  ROMs  would  be  of  a  manageable  size[7].  If  we  let  6  be 
the  number  of  bits  in  the  original  integer  X,  and  we  wish  to  determine  X{  =  |A"|m,, 
where  m*  has  6,  bits,  then  the  ROM  must  be  of  size  2*  x  6,  bits.  Although  b,  can 
be  kept  relatively  small  (3-7  bits),  the  original  dynamic  range  determined  by  b  may 
be  large,  requiring  a  very  large  ROM  with  many  small  words.  An  ideal  solution  to 
this  problem  is  to  divide  the  problem  into  smaller  pieces.  The  6- bit  number  may  be 
divided  into  several  smaller  numbers,  such  as  the  least  significant  and  most  significant 
|  bits.  These  smaller  numbers  can  be  looked  up  in  ROMs  independently,  weighted 
by  proper  powers  of  two.  and  then  added  together  in  a  modulo  m,  adder.  Techniques 
for  modular  addition  will  be  discussed  in  the  next  chapter. 


2.2  RNS  Processing 

The  processing  of  data  in  the  RNS  parallels  the  processing  of  data  for  the  same  prob¬ 
lem  in  two's  complement  notation,  for  the  RNS-compatible  operations  of  addition, 
subtraction,  and  multiplication.  Hardware  implementation  of  these  operations  will 
be  discussed  in  the  following  chapter. 

Scaling  is  a  process  that  must  be  treated  much  differently  in  the  RNS  than  in 
two's  complement.  Whereas  two’s  complement  data  may  be  scaled  by  shifting  bits, 
one  place  value  for  each  factor  of  two,  scaling  a  number  in  the  RNS  requires  a  lengthy 
series  of  operations.  Scaling  is  different  from  general  division  in  that  data  is  divided 
by  a  predetermined  constant  and  then  rounded.  Szabo  and  Tanaka  explain  the  most 
efficient  type  of  scaling,  in  which  the  data  is  scaled  by  one  or  more  moduli.  The  first 
step  in  the  scaling  process  is  to  round  the  number  to  a  multiple  of  the  intended  scaling 
factor.  Let  X  be  the  integer  to  be  scaled,  and  let  Y  be  the  scaling  factor.  If  we  round 
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(2.1) 


X  down  to  a  multiple  of  Y,  then  the  scaling  problem  is  reduced  to  computing 

*-|X|y 
Y 

Since  Y  will  be  one  of  the  moduli  or  the  product  of  several,  the  existence  of  its 
multiplicative  inverse  modulo  the  other  moduli  is  guaranteed  by  the  requirement 
that  the  moduli  be  relatively  prime.  The  residues  of  X  are  therefore  multiplied  by 
If  |m,  within  each  of  the  RNS  channels.  The  final  step  in  the  scaling  problem  is  to 
perform  a  base  extension  to  determine  the  residues  of  the  scaled  result  for  the  moduli 
by  which  X  was  scaled.  The  base  extension  is  similar  to  a  mixed-radix  conversion  on 
the  moduli  by  which  X  was  not  scaled;  the  mixed-radix  conversion  will  be  discussed 
in  the  next  section  [14]. 

The  RNS  scaling  process  will  be  demonstrated  by  an  example,  taken  from  Szabo 
and  Tanaka,  in  which  a  positive  integer  is  scaled  by  two  moduli.  Let  the  moduli  be 
mi  =  2 ,m2  =  3,  m3  =  5,  and  m4  =  7.  The  integer 

X  =  89  ->  {1,2, 4, 5} 

will  be  scaled  by  15  =  3  ■  5,  yielding  the  result  z  =  yj.  The  solution  is  shown  below. 
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Moduli: 


2  3  5  7 


Residue  representation  of  X 

1 

2 

4 

5 

Subtract  |A"|3  =  2 

0 

2 

2 

2 

1 

0 

2 

3 

2 ,3, 5,7 
<— ► 

X  -  |X|3 

Multiply  by  N* 

1  **  Im, 

1 

- 

2 

5 

1 

- 

4 

1 

2,5,7 

4— ► 

3 

Subtract  jX|5  =  4 

0 

- 

4 

4 

1 

- 

0 

4 

2,5,7 

4— ► 

^  m* 

Multiply  by  jij 

1 

- 

- 

3 

Enter  0  in  missing  columns 

1 

0 

0 

5 

2,7 

4— ► 

for  base  extension 

Subtract  1 

1 

1 

1 

1 

0 

2 

4 

4 

Multiply  by  U| 

t  *  Im, 

- 

2 

3 

4 

- 

1 

2 

2 

Subtract  2 

- 

2 

2 

2 

- 

2 

0 

0 

Then  ||r(3  +  2  3  =  0  and  |iz|5  +  0s 

=  0 

;  hence, 

l*ls  = 

2  and  |z|5  =  0. 

Therefore,  the  residue  representation  of  the  scaled  result  f|  is  {1,2, 0,5}  ~  5. 


This  example  shows  that  it  takes  L  multiplies  and  L  additions  (in  each  of  the  RNS 
channels)  to  scale  a  number  in  the  RNS,  not  including  the  final  subtraction  and 
multiplication  required  to  determine  the  final  two  residues  of  Remember  that  L  is 
the  number  of  moduli  in  the  system(14j. 

There  are  several  observations  that  should  be  made  concerning  this  scaling  al¬ 
gorithm.  The  first  is  that  the  number  of  operations  is  independent  of  the  number 
of  moduli  by  which  the  integer  is  being  scaled,  except  for  the  final  subtraction  and 
multiplication  to  solve  for  z ,  which  may  be  done  in  parallel  anyway.  All  of  the  opera- 
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tions  are  also  performed  within  the  separate  RNS  modular  arithmetic  channels.  The 
process  is  complicated  slightly  if  we  are  interested  in  rounding  to  the  nearest  integer 
instead  of  simply  rounding  down  and  if  we  need  to  scale  negative  integers  as  well  as 
positive  ones.  These  cases  are  explained  in  more  detail  by  Szabo  and  Tanaka  [14]. 

2.3  Conversion  From  RNS  to  Two’s  Complement 

There  are  two  methods  for  converting  an  integer  from  the  RNS  to  a  conventional  two’s 
complement  format.  These  techniques  are  called  the  Chinese  Remainder  Theorem 
and  the  mixed-radix  conversion  and  were  presented  in  Chapter  1.  Examples  of  these 
conversion  methods  will  be  given  to  help  explain  the  processes. 

The  Chinese  Remainder  Theorem  (CRT)  is  the  classical  conversion  algorithm. 
The  equation  from  Chapter  1  is  repeated  here: 

.Y  =  Y"  ( m,  —  j  mod  A/,  where  mi  =  A//mt.  (2.2) 

i  =  l  \  m'  m,/ 

The  following  example  of  the  CRT  is  taken  from  a  tutorial  by  Taylor  [16].  Let 

mt  =3 ,  m-i  —  4,  and  m3  =  5,  such  that  M  =  60.  The  problem  will  be  to  convert 

X={  1.0,4}  back  to  an  integer.  The  values  of  m,  and  their  multiplicative  inverses  can 
be  precalculated: 

mj  =  20.  mf1  =  2 

m2  -  15.  mj1  =  3 

m3  =  12.  m3l  =  3. 

The  solution  is 

.Y  =  |(20-|l-2!3)  +  (15.|0-3U)  +  (12-|4-3|5)|6O 

=  4. 

The  CRT  requires  one  layer  of  modular  multiplications  and  then  a  sequence  of  L 
multiplications  and  L  —  1  adds  modulo  M.  The  requirement  for  a  modulo- M  adder 
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is  the  main  disadvantage  of  the  CRT.  An  RNS  system  will  have  components  to  do 
modular  arithmetic  for  the  factors  of  M,  but  not  for  M  itself. 

The  equation  for  the  mixed-radix  conversion  (MRC)  is 


h- 1 

X  =  a.v  n  +  -f  a3m1m2  +  a2mi  +  ax.  (2.3) 

1=1 

The  coefficients  of  the  form  a,  are  called  the  mixed-radix  coefficients,  and  each  one 
may  take  on  values  of  0, . . . ,  m,  —  1.  The  mixed-radix  representation  of  a  number  is 
a  weighted  number  system  because  the  value  of  the  integer  is  a  weighted  sum  of  the 
coefficients.  This  representation  can  therefore  be  used  for  magnitude  comparison  and 
sign  detection  when  negative  integers  are  encoded.  Szabo  and  Tanaka  explain  how 
these  coefficients  are  found  through  a  series  of  nested  subtraction..-.  ^nd  divisions.  If 
Equation  2.3  is  first  evaluated  modulo  ,  it  is  dr.n  that 


A' 


mi 


=  aj. 


so  that  ei]  is  simply  the  first  residue  digit.  The  next  coefficient  is  found  by  noting 
that 


\X  -ai 

I  TYl\ 


m2 


=  a2. 


The  division  by  m.\  is  possible  because  its  multiplicative  inverse  exists  for  all  the  other 
moduli,  by  virtue  of  the  fact  that  the  moduli  are  selected  to  be  relatively  prime.  The 
remaining  mixed-radix  coefficients  can  be  found  by  repeating  this  subtraction  and 
division,  as  demonstrated  in  the  following  example  from  Szabo  and  Tanaka. 
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Moduli: 

8 

5 

7 

3 

Residue  represen¬ 

3  =  ax 

4 

2 

1 

ai  —  —  3 

tation  of  X 

Subtract  ai  =  3 

3 

3 

3 

0 

0 

1 

6 

1 

8, 5, 7, 3  v 
<— ►  X  —  CL\ 

Multiply  by  s 

1  3  1  m, 

2 

1 

2 

2 

=  a2  6 

2 

— ItelL-lffll.-2 

Subtract  a i  -  2 

2 

2 

2 

0 

4 

0 

s£3 

Multiply  bv  Uj 

*'  v  1  0  lm, 

3 

2 

5  = 

O 

<rs 

a 

II 

—  IlsfelL-Hillr-5 

Subtract  a3  —  5 

5 

2 

0 

1 

7.3 

~  -S -  a3 

Multiply  by  |~| 

1  '  lm, 

1 

1  =  a4 

°4  ~  |[m,m2m3]L4  “  I  [ 280 ]  |3  ~  1 

The  mixed-radix  representation  of  .V  is 

now  {1,5, 2, 3} 

.  From  Equation  2.3  the  in- 

teger  is 

oo 

II 

> 

■  5 

•  7)  +  5(8 

•5) +  2(8) +  3(1)  =  499. 

The  mixed-radix  conversion  process  requires  L  —  1  layers  of  subtractions  and 
L  —  1  layers  of  multiplies  within  the  RNS  components.  This  is  a  count  of  the  layers  of 
operations  because  the  individual  RNS  channels  will  perform  operations  in  parallel. 
Conventional  arithmetic  is  then  required  to  multiply  each  coefficient  by  its  weight 
and  add  the  components  (L  —  1  multiplies  and  the  same  number  of  adds.)  While  the 
operation  count  is  higher  than  for  the  CRT,  most  of  the  arithmetic  is  done  within  the 
RNS  channels  and  no  modulo- A/  adder  is  required. 
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Chapter  3 


Performance  of  RNS  Hardware 


In  this  chapter,  a  set  of  performance  measures  for  analysis  of  architectures  for  comput¬ 
ing  DFTs  will  be  established.  These  measures  will  be  applied  to  the  basic  hardware 
components  used  in  computing  DFTs  to  determine  the  performance  advantage  of 
RNS  hardware  at  the  building  block  level.  These  performance  measures  will  then 
be  used  in  the  next  chapter  as  the  basis  for  a  comparison  between  binary  weighted 
number  and  RNS  architectures  for  computing  entire  DFTs. 


3.1  Performance  Measures 

Some  primary  concerns  of  a  system  designer  for  signal  processing  are  speed,  size,  cost, 
and  power.  The  last  three  are  directly  related  to  each  other  for  a  given  technology 
base,  so  cost  and  power  can  be  reasonably  predicted  given  some  measure  of  size.  Speed 
is  the  other  main  concern  and  must  defined  such  that  it  reflects  only  an  architecture's 
ability  to  solve  a  problem  quickly  and  not  parameters  such  as  the  clocking  scheme  or 
operation  definition.  Since  the  goal  is  to  compare  design  architectures,  we  can  choose 
either  high-level  measures  (gate  delay  and  gate  count)  or  low-level  measures  (such 
as  silicon  area  and  actual  time)  for  a  specific  technology,  under  the  assumption  that 
these  measures  would  scale  consistently  for  a  different  technology. 
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3.1.1  Speed 

The  primary  goal  in  measuring  speed  is  to  compare  the  impact  of  choosing  the  RNS 
over  conventional  arithmetic.  In  this  analysis,  the  same  algorithm,  architecture,  and 
building  blocks  will  be  used  to  design  two  systems  for  computing  DFTs,  except  that 
RNS  components  will  be  substituted  for  two’s  complement  components  in  one  sys¬ 
tem.  Speed  will  be  measured  by  the  time  to  compute  one  DFT,  which  shall  be  called 
latency.  Pipelining  and  other  techniques  used  to  increase  throughput  will  not  be  used 
because  their  impact  would  be  similar  for  both  systems.  The  situation  in  which  it 
would  not  be  similar  is  when  pipeline  registers  are  placed  inside  the  arithmetic  oper¬ 
ators  instead  of  between  them.  While  allowing  data  to  pass  through  at  a  higher  rate, 
pipelining  does  not  reduce  the  actual  time  spent  performing  the  necessary  operations 
on  the  data.  In  fact,  the  latency  would  increase  due  to  the  additional  time  of  passing 
data  through  the  pipeline  registers. 

3.1.2  Size 

The  size  of  a  system  is  a  parameter  which  generally  increases  with  its  speed.  For  a 
given  technology  and  design  approach,  a  system’s  speed  can  be  increased  by  using 
more  hardware,  but  it  is  desirable  to  improve  upon  a  system’s  speed  without  a  pro¬ 
portional  increase  in  size.  Other  parameters  of  a  system  that  increase  with  size  are 
cost  and  power;  although  the  relationship  may  not  be  linear,  these  two  parameters 
can  be  reasonably  estimated  from  size.  Thus,  the  metrics  used  for  comparing  systems 
will  be  limited  to  speed  and  size. 

Determining  how  to  measure  size  is  a  somewhat  more  difficult  problem.  Whatever 
measure  is  used,  it  should  be  one  that  allows  a  relative  comparison  of  two  designs 
independent  ot  the  common  technology  used  to  implement  them.  One  such  measure  is 
gate  count.  Given  the  hardware  components  required  for  a  design  (adders,  registers, 
etc.),  a  library  of  standard  cells  can  be  used  to  determine  gate  count.  One  drawback 
to  this  approach  is  that  not  all  hardware  translates  directly  into  gates,  an  example 
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being  ROMs.  ROMs  are  frequently  used  to  implement  arithmetic  in  RNS  devices,  so 
the  relative  size  between  a  ROM  and  other  logic  elements  will  have  to  be  established. 
The  approach  chosen  here  will  be  to  develop  an  expression  for  the  size  and  speed  of 
a  ROM  in  terms  of  the  size  and  speed  of  a  full  adder  for  a  3/rm  CMOS  technology. 
A  full  adder  is  used  for  comparison  because  it  is  the  basic  computational  element  for 
addition  and  multiplication.  After  expressing  these  measures  relative  to  those  of  a 
full  adder,  conversions  may  be  made  to  units  such  as  equivalent  gate  count  and  delay. 

3.1.3  Area-Time  Metric 

Comparing  the  performance  of  two  or  more  system  designs  is  complicated  by  the  fact 
that  they  generally  have  different  sizes  and  different  times.  Sometimes  one  design 
dominates  the  others  in  the  sense  that  it  has  a  smaller  si.  ..nd  is  faster;  more  gener¬ 
ally,  however,  the  designs  cannot  be  directly  ranked  by  both  measures  simultaneously. 
To  resolve  this  difficulty,  we  propose  using  an  area-time  product  to  objectively  com¬ 
pare  designs  in  terms  of  a  single  metric.  Combining  area  and  time  into  this  single 
metric  assumes  that  there  is  a  tradeoff  in  area  and  time  for  any  component  such  that 
their  product  remains  approximately  constant.  The  backing  for  this  claim  is  the  idea 
that  twice  the  hardware  can  be  used  to  do  twice  as  many  operations  in  any  given 
time  interval,  thus  halving  the  time  per  operation.  This  metric  does  not  address  the 
cost  of  interconnecting  these  components,  but  it  does  reflect  inherent  space  and  time 
requirements  for  computations  performed  by  them. 

3.1.4  ROM  Layout  vs.  Full  Adder 

This  section  will  develop  a  set  of  equations  to  approximate  the  size  and  speed  of  a 
ROM  relative  to  a  full  adder  for  a  3  //m  CMOS  technology. 

Layout  of  a  ROM 

The  construction  of  a  typical  ROM  is  shown  in  Figure  3.1.  The  ROM  will  be  analyzed 
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Figure  3.1:  Construction  of  ROM. 


in  terms  of  the  three  main  parts  in  the  figure  to  determine  an  expression  for  the  silicon 
area  and  time  delay.  These  measures  will  depend  on  the  geometry  of  the  ROM;  ROMs 
are  usually  laid  out  in  the  shape  of  a  square,  as  opposed  to  a  rectangle  with  different 
dimensions.  Let  n,  be  the  number  of  bits  in  the  input,  or  address,  and  n0  be  the 
number  of  bits  in  each  output  word.  For  a  ROM  of  2n’  x  na  bits  the  ideal  length  of 
each  side  of  the  memory  portion  would  be  \Jn0 2"’  bits.  Each  column  in  the  ROM 
contains  an  n0-bit  word,  so  the  number  of  columns  from  which  the  address  bits  must 


select  is 


"o 


The  number  of  address  bits  in  the  column  selector  is  given  by 


(3.2) 


so  that  the  number  of  address  bit  in  the  row  decoder  is  given  by 


(3.3) 


where  [r]  represents  the  nearest  integer  to  x  (rounded).  Care  must  be  taken  in 
rounding  so  that  c  +  r  =  n,. 


The  analysis  of  the  three  parts  of  the  ROM  is  discussed  by  Hodges  and  Jackson. 
Both  the  row  decoder  and  the  core  of  the  ROM  are  usually  designed  as  a  NOR  array, 
as  shown  in  Figure  3.2.  The  pull-up  devices  may  be  either  pMOS  precharge  or  nMOS 
depletion- mode  devices  [4].  For  the  row  decoder,  there  we  two  rows  of  transistors  for 
each  of  the  r  row  address  bits.  The  row  decoder  has  2r  output  lines  that  connect  to 
the  rows  of  the  ROM  core. 

The  NOR  arrays  get  their  name  because  they  operate  just  like  multiple-input 
NOR  gates.  The  output  of  each  column  in  the  array  is  the  result  of  NORing  all  the 
inputs  to  the  nMOS  devices.  These  devices  are  connected  in  parallel  to  ground,  so 
that  a  high  input  to  any  of  them  results  in  the  column  being  discharged.  Either  the 
nMOS  devices  or  the  contacts  to  them  are  placed  selectively  to  generate  the  desired 
output  for  each  possible  set  of  inputs.  For  the  row  decoder,  each  column  will  be 
a  word  select  line  for  the  ROM.  The  transistors  in  that  column  are  connected  to 
the  proper  set  of  address  input  bit  lines,  or  their  inverses,  so  that  only  one  column 
remains  high  for  each  possible  address.  The  row  decoder  is  rotated  90  degrees  so  that 
the  columns  run  horizontally;  this  allows  the  column  lines  to  be  connected  directly  to 
the  row  lines  of  the  ROM.  Only  one  row  line  in  the  ROM  will  be  high  for  any  set  of 
inputs,  and  transistors  are  placed  appropriately -in  each  row  of  the  ROM  to  generate 
the  correct  output  word. 

The  simplest  and  most  common  type  of  column  decoder  is  the  tree  decoder,  shown 
in  Figure  3.3.  This  is  simply  a  tree  of  nMOS  transistors  used  as  pass  gates  to  select 
from  one  of  2C  columns,  where  c  is  the  number  of  column  address  bits.  There  must 
be  one  of  these  trees  for  each  bit  in  the  ROM’s  output  word. 

The  area  of  a  ROM  is  determined  by  the  dimensions  of  the  diagram  in  Figure  3.1. 
The  width  of  the  entire  ROM  is  approximately  equal  to  the  width  of  the  row  decoder 
plus  the  width  of  the  main  memory  portion.  The  height  is  approximately  equal  to 
the  height  of  the  main  memory  plus  the  height  of  the  column  decoder.  The  inverters 
for  both  the  row  and  column  address  bits  should  fit  within  the  space  in  the  lower  left 
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Figure  3.3:  Column  tree  decoder. 


corner. 

The  sizes  of  the  row  decoder  and  memory  cells  can  be  found  by  studying  the  layout 
of  a  NOR  array,  shown  in  Figure  3.4.  This  figure  shows  the  most  compact  layout  of  the 
array  on  a  lambda  scale  for  a  set  of  MOSIS  scalable  design  rules.  We  will  temporarily 
use  R  to  represent  the  number  of  rows  and  C  the  number  of  columns  in  the  NOR 
array.  There  must  be  9A  between  each  column,  as  shown,  plus  approximately  2A 
on  the  left  side  for  spacing  from  the  row  decoder  or  other  devices.  The  resulting 
expression  for  the  width  of  the  NOR  array  is 

fF,vo«  =  (9C  +  2)A.  (3.4) 

The  height  of  the  array  depends  on  the  number  of  rows.  R.  Figure  3.4  shows  that  there 
is  16A  for  each  pair  of  adjacent  rows.  Additionally,  there  is  19A  on  top  for  precharge 
or  loading  devices  and  8A  on  bottom  for  separation  from  the  column  decoder.  The 
final  expression  for  the  height  of  the  NOR  array  is 

H^ior  —  ( 16R/2  +  8  +  19)  A 

=  (8R  +  27)A.  (3.5) 


32 


Figure  3.4:  Layout  of  NOR  array. 
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An  expression  for  the  height  of  the  column  decoder  is  now  needed  so  that  the 
height  of  the  whole  ROM  can  be  determined.  The  exact  width  of  the  column  decoder 
is  not  important  because  it  is  less  than  the  width  of  the  ROM  core,  as  shown  in 
Figure  3.1.  Figure  3.5  shows  the  layout  of  the  column  decoder.  The  height  of  the 
column  decoder  can  be  expressed  as 

Hcd  =  18cA,  (3.6) 

where,  again,  c  is  the  number  of  address  bits  in  the  column  decoder. 

The  size  of  the  entire  ROM  can  now  be  found.  The  width  is  equal  to  the  height 
of  the  row  decoder  (before  it  is  rotated)  plus  the  width  of  the  ROM  core: 

Wrom  =  (8(2r)  +  27)A  +  (9(2cn0)  +  2)A 

=  (i6r  +  9(2>0  +  29)A.  (3.7) 

The  height  of  the  ROM  is  eo>  .1  to  the  height  of  the  ROM  core  plus  the  height  of  the 
column  decoder: 

Hrom  =  (8(2r)  +  27)A  +  18cA 

=  (2(r+3)  +  18c  +  27)A  (3.8) 

The  total  area  can  now  be  written  as 

A-rom  =  Wrom  x  Hrom 

=  ( 16r  +  9(2c)n0  +  29)(2(r+3)  +  18c  +  27)A2.  (3.9) 

The  timing  analysis  for  the  ROM  begins  with  the  following  assumptions  for  the 
process  parameters: 


Symbol 

Parameter 

Value 

A 

|  length  of  min  transistor 

1.5  fi  m 

Hpoty 

resistance  of  polysilicon 

50f!/square 

C0x 

capacitance  of  thin  oxide 

500# 

cd 

source  or  drain  capacitance 

C./3 
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Figure  3.5:  Layout  of  column  decoder. 
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The  gate  capacitance  for  a  minimum  size  transistor,  of  4.5/xm  x  3.0/xm  is 


Cg  =  4.5/xm  x  3.0/im  x  500/zF/m2 
=  6.75/F. 


(3.10) 


The  parameters  Kn  and  Kp,  which  are  used  in  the  equations  that  model  MOS  devices, 
are  estimated  below  for  a  4.5^m  x  3.0/zm  transistor: 


I<n 

KP 


UnCox 


width 

length 


I<n/  2 

■3^. 


(3.11) 

(3.12) 


The  access  time  for  the  ROM  can  be  expressed  as 


—  Trow  +  Trom  +  Tcol, 


(3.13) 


where 


Trow  =  delay  through  row  decoder 
Trom  =  delay  through  ROM  cells 
Tcol  =  cWav  through  column  decoder. 


The  calculations  for  Tr0w  and  Trom  are  similar  because  both  are  delays  through  NOR 
arrays.  Hodges  and  Jackson  analyze  the  delay  through  a  NOR  array  by  expressing 
it  as  the  sum  of  the  switching  time  for  the  row  lines  and  the  charging/discharging  of 
the  column  lines.  Let  tT  be  the  delay  through  a  row  in  the  NOR  array  and  let  tc  be 
the  delay  through  a  column.  The  delay  through  a  row  can  be  approximated  by  the 
following  equation  for  the  50%  output  transition  for  a  uniformly  distributed  RC  line 
with  a  step  input: 

tT  =  .38  RC,  (3.14) 
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where  R  is  the  total  resistance  and  C  is  the  total  capacitance  of  the  distributed  line. 
The  delay  through  a  column  in  the  NOR  array  can  be  approximated  by  the  time  for 
the  p-type  device  at  the  top  of  a  column  to  charge  the  column  capacitance  to  the 
50%  point  [4]: 


CAV 


(3.15) 


The  values  of  R  and  C  are  calculated  by  using  the  parameters  above  and  the  geometry 
of  the  layouts  in  Figures  3.4  and  3.5. 

The  delay  through  the  row  decoder,  Trow,  will  now  be  calculated  by  determining 
the  delays  through  the  rows  and  columns  within  its  NOR  array.  There  are  r  bits  in 
the  row  decoder,  and  each  bit  plus  its  inverse  go  into  separate  rows  in  the  NOR  array, 
yielding  2r  rows.  There  are  2r  columns,  each  driving  a  separate  word  select  line  in 
the  ROM  core.  For  an  array  of  2r  rows  by  2r  columns,  the  delay  through  a  row  is 


tr  =  .38(2r  X  l  X  5012 )(2r  x  6.75/F) 
=  5.77  x  L0~4  x  4rns. 


(3.16) 


To  calculate  the  column  delay,  the  column  capacitance  and  average  charging  current 
for  a  column  must  be  found.  Each  column  will  have  a  transistor  connecting  it  to  each 
of  the  input  bits  or  its  inverse,  for  a  complete  row  decoder.  Thus,  each  column  has  r 
of  the  2 r  rows  connected  to  it,  resulting  in  a  column  capacitance  of 

6.75/F 


r  x 


3 


(3.17) 


The  worst  case  delay  is  when  the  depletion  mode  device  at  the  top  of  a  column  must 
charge  the  column  from  0  to  5  volts.  The  average  current,  Iavg,  is  the  average  of 
the  current  at  0  volts  and  the  current  at  2.5  volts  output.  The  parameter  K  for  the 
depletion  device  is  assumed  to  be  j  that  for  a  normal  nMOS  device,  and  the  threshold 
voltage  is  assumed  to  be  Vj  =  —  3u.  At  0  volts,  the  depletion  device  has  5  volts  across 
it,  but  Vq  —  Vj  =  5u  —  (—3v)  =  8u,  so  it  is  in  its  linear  region.  Thus,  the  current  at 
the  beginning  of  the  charging  is 

/ov  =  -“((5v  -  (~3v))5v  -  —p) 
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179/iA. 


(3.18) 


At  an  output  level  of  2.5  volts,  the  transistor  is  still  in  the  linear  region,  so  the  current 
at  this  point  is 

h.sv  =  ~^-((5u  -  (— 3u))2.5u  -  ^— ) 

=  110  nA.  (3.19) 

The  delay  through  the  columns  in  the  row  decoder  is  now 

r  x  x  2.5u 

t  -  _ 2 _ 

c  179m4  +  110mA 

2 

=  0.0389  xr  ns.  (3.20) 


The  total  delay  through  the  row  decoder  is  now 

Trow  =  (5.77  x  10~4  x  T  +  0.0389r)ns. 


(3.21) 


The  delay  through  the  ROM  memory  cells  can  be  calculated  in  a  similar  manner. 
The  size  of  the  array  is  now  2r  rows  by  n02c  columns.  The  delay  through  a  row  is 
given  by 


tT  =  ,38(no2c(-)50n)(no2c6.75  fF) 
=  5.77  x  10-4n^4cns. 

The  delay  through  a  column  is  given  by 

2r  x  x  2.5u 


(3.22) 


tc  = 


1  "9u>44- 1  lOuM 
2 

=  0.0389  x  2rns. 


(3.23) 


The  total  delay  through  the  ROM  cells  is 

Tram  =  5.77  x  10-4n*4c  +  0.0389  x  2rns.  (3.24) 

The  delay  through  the  column  decoder  can  be  approximated  by  using  Equa 
tion  3.15  and  assuming  that  the  source  and  drain  capacitances  must  be  discharged  by 
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an  nMOS  transistor  whose  length  is  c  times  the  length  of  each  individual  transistor. 
The  total  capacitance  is 

C(l+2c)5Sff.  (3.25) 

Since  the  length  to  width  ratio  of  the  effective  transistor  is  c  times  normal,  the  currents 
before  and  halfway  through  discharging  are 


/5V  =  — — /fn(5v  —  ,6u)2  (saturation) 
2c 

251.7  , 

=  - 

c 

1  (2  5ul2 

h. 5V  =  -{<«({ 5v  ~  .6v)2.5v  - 

204.8 

=  - fiA. 

c 


The  average  current  is 


/ 


avg  —  n 


251.7  +  204.8 


2 


/iA  =  228/iA. 


(3.26) 


(3.27) 


(3.28) 


The  delay  through  the  column  decoder  is  therefore 

(l+2c)^£(2.5v)(2c) 


Teal  = 


228/xA 
=  0.0493c(l  +  2c)ns. 


(3.29) 


The  above  results  for  area  and  time  are  summarized  in  Figures  3.6  and  3.7  respectively. 
These  plots  show  the  area  and  time  as  a  function  of  na,  the  size  of  an  output  word. 
The  three  cases  when  n,  is  either  n0,  n0  +  1,  or  2 n0  will  later  be  shown  to  be  of 
primary  interest  for  the  ROMs  used  in  RNS.  The  three  curves,  therefore,  represent 
these  three  cases.  The  layout  of  a  full  adder  is  now  analyzed  so  that  a  comparison 
can  be  made  between  it  and  a  ROM. 


Layout  of  a  Full  Adder 

The  schematic  and  layout  of  a  typical  full  adder  as  given  by  Weste  and  Eshraghian  is 
shown  in  Figure  3.8  [21].  This  adder  makes  heavy  use  of  transmission  gates.  Although 
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Bits  in  ROM 

Figure  3.6:  Size  of  ROM  (A2). 
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other  full  adder  designs  are  possible,  they  require  the  same  number  of  transistors  (24) 
and  approximately  the  same  area.  The  size  of  this  adder,  using  the  same  3/xm  design 
rules  as  for  the  ROMs,  is  76A  x  101 A  =  7, 676A2. 

The  delay  through  the  full  adder  can  be  estimated  using  Equation  3.15.  The 
critical  delay  path  is  from  carry-in  to  carry-out.  Figure  3.8  shows  that  the  signal 
must  propagate  through  three  gates:  carry-in  passes  through  an  inverter,  the  inverted 
signal  passes  through  a  pass  gate  controlled  by  tne  inputs  A  and  B,  and  the  result 
passes  through  another  inverter.  The  worst  case  is  when  carry-in  changes  from  0  to 
1.  because  then  the  output  inverter  will  also  change  from  0  to  1.  It  will  be  shown 
that  the  second  inverter  has  a  larger  capacitive  load  than  the  first,  so  that  the  worst 
case  occurs  when  the  second  must  charge  the  load  from  Ov  to  5v  through  its  p-type 
device. 

The  capacitive  load  on  the  first  inverter  comes  from  the  4  source/drains  of  the 
pass  gates  at  that  node  and  from  the  length  of  the  polysilicon  line.  The  capacitance 
of  this  line  is 

30 A  x  1.5^—  x  .2-^  =  9fF. 

A  mm 

The  total  capacitance  is  now 

C  =  4^^  +  9fF=  18fF.  (3.30) 

O 


The  average  discharge  current  through  an  n-type  device  was  previously  found  to  be 
228/iA.  The  delay  through  the  first  inverter  can  now  be  calculated: 


18fF  •  2.5v 
228/iA 
,197ns. 


(3.31) 


The  capacitive  load  on  the  pass  gate  is  due  to  the  2  gates  of  the  inverter  it  drives 
and  the  polysilicon  line  connecting  it  to  the  inverter.  The  capacitance  of  this  line  is 


16A  x  1.5^  x  .2—  =  4.8fF. 
A  mm 
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The  total  capacitance  is 


C  =  2  •  6.75fF  +  4.8fF  =  18.3fF. 


(3.32) 


A  good  estimate  of  the  average  current  through  the  pass  gate  is  the  average  current 
through  an  n-type  device,  228/xA.  The  delay  through  the  pass  gate  is 


18.3FF  •  2.5v 
228/iA 
=  ,201ns. 


(3.33) 


The  capacitive  load  on  the  second  inverter  comes  from  the  polysilicon  carry-in 
line  of  the  next  full  adder,  the  2  gates  of  its  first  inverter,  plus  the  2  source/drains  of 
the  pass  gate  also  connected  to  the  carry-in  line.  The  capacitance  of  this  line  is 

57 A  x  1.5^  x  .2-^-  =  17.1  fF. 

A  mm 

The  total  capacitance  is  now 

C  =  2  •  6.75fF  +  2^^-  +  17.1  fF  =  35.1fF.  (3.31) 

The  average  current  through  a  p-type  device  iS  approximately  one  half  that  through 
an  n-type,  so  the  average  charging  current  through  this  inverter  is  ^/xA  =  114/*  A. 
The  delay  through  the  second  inverter  is 

35.lfF-2.5v 
mv2  ~  114/iA 

=  ,770ns.  (3.35) 


The  total  delay  from  carry-in  to  carry-out  is 

Tfa  =  1.168ns.  (3.36) 

The  area  and  speed  of  a  ROM  can  now  be  normalized  to  a  full  adder  by  dividing  the 
results  of  Figures  3.6  and  3.7  by  the  area  and  speed  of  a  full  adder  as  just  calculated. 
These  results  are  shown  in  Figures  3.9  and  3.10.  Figure  3.11  shows  the  area-time 
product  of  a  ROM  relative  to  the  area-time  of  a  full  adder.  The  efficiency  of  ROM 
layouts  is  apparent  by  noting  that  the  area-time  product  of  a  4-input,  2-output  ROM 
is  slightly  less  than  the  area-time  of  a  full  adder. 
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3.2  RNS  Computational  Elements 


Before  delving  into  a  performance  analysis  for  the  specific  problem  of  performing 
DFTs,  an  analysis  of  the  individual  computational  hardware  elements  used  in  this 
problem  will  be  done.  For  DFTs,  these  elements  are  adders  and  multipliers. 

3.2.1  RNS  Adders 

As  mentioned  in  Chapter  I,  there  are  at  least  three  ways  to  implement  modular 
addition  with  digital  hardware.  The  first  is  to  concatenate  the  binary  representation 
of  the  residues  and  use  the  result  as  an  address  to  a  ROM  look-up  table  which  contains 
the  sum,  modulo  the  proper  integer,  as  shown  in  Figure  3.12.  This  method  can  be 
used  for  multiplication  also,  with  the  result  that  these  two  operations  will  take  the 
same  amount  of  time.  For  an  n-bit  modulus,  2n_1  <  m*  <  2”,  the  ROM  will  have  22n 
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Bits  in  ROM 

Figure  3.10:  Speed  of  ROM  relative  to  a  full  adder, 
words  of  length  n  bits. 

Soderstrand  presents  two  other  methods  of  performing  modular  addition  [11]. 
The  basic  procedure  is  to  add  the  two  numbers  like  normal  binary  numbers  and  then 
to  correct  the  result  by  subtracting  the  modulus  if  the  result  exceeds  the  range  of 
numbers  allowed  by  that  modulus.  There  are  two  ways  to  do  the  correction.  One 
method  uses  a  ROM  to  correct  the  n  +  1-bit  sum  to  an  n-bit  sum  within  the  proper 
range,  as  shown  in  Figure  3.13.  While  still  requiring  a  ROM,  the  size  of  the  ROM 
has  been  decreased  from  22n  to  2n+1  words.  Another  method  is  to  use  a  second  n- 
bit  adder  to  add  the  value  2n  —  m,  to  the  output  of  the  first  adder.  If  either  adder 
produces  a  carry,  the  corrected  result  is  used;  otherwise,  the  output  of  the  first  adder 
is  used.  This  scheme  can  be  derived  by  starting  with  a  two’s  complement  system 
that  subtracts  m;  if  the  result  of  the  first  addition  is  greater  than  or  equal  to  m,.  Let 
the  addends  be  x  and  y  such  that  0  <  x  <  m,,0  <  y  <  m,.  In  twos  complement 
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Figure  3.13:  Modulo  m,  adder  using  correction  ROM. 

notation  these  each  require  n  +  1  bits.  The  first  adder  computes  z  =  x  4-  y.  which 
requires  n  +  2  bits.  The  next  adder  computes  z  =  s  -  m;,  which  must  fall  in  the 
range  -m,  <  z  <  m,  -  1,  an  n  4-  Tbit  twos  complement  number.  If  z  is  negative, 
then  z  —  z\  otherwise,  z  —  z.  This  procedure  can  be  simplified  by  noting  that  since 
negative  integers  are  not  used  for  x,  y,  and  z  then  only  n  bits  are  needed.  Furthermore, 
the  most  significant  bit  in  z  and  z  can  be  handled  by  separate  logic  to  select  from 
the  two  possible  choices  for  z.  The  negative  of  m,  in  an  n  +  1-bit  twos  complement 
system  has  the  same  representation  as  2"  —  m*  for  the  first  n  bits  of  interest;  this 
is  why  the  procedure  can  be  interpreted  as  adding  2"  —  m,  if  the  result  of  the  first 
addition  is  greater  than  m,  —  1.  A  block  diagram  of  this  type  of  RNS  adder  is  shown 
in  Figure  3.14.  The  adder  can  be  further  simplified  by  noting  that  if  m,  is  fixed,  then 
the  second  adder  can  be  customized  to  reduce  its  size.  The  second  layer  of  full  adders 
can  be  replaced  by  the  equivalent  of  half  adders  since  one  addend  is  always  known. 

An  RNS  adder  constructed  like  Figure  3.14  requires  two  n-bit  adders,  an  OR  gate, 
and  n  l-bit  multiplexors.  The  area  of  this  adder  is  given  by 


ArP] Sadd  —  2nAf„  4“  Aor  4"  nAmwx, 


(3.37) 
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Figure  3.14:  Modulo  m,  adder  using  two  binary  adders. 

where 

A/a  =  area  of  a  full  adder 
Aa  =  area  of  an  OR  gate 
Ami  =  area  of  a  multiplexor. 

Assuming  A/a  S>  A0,  Amr.  the  size  of  the  adder  can  be  approximated  by 

A-RNSadd  ~  2 nAja.  (3.38) 

The  time  required  to  perform  the  addition  is  given  by 

TRNSadd  =  (n  +  \  )Tj  a  +  Ta  +  Tmxi  (3.39) 

where 

Tfa  =  delay  through  a  full  adder 
T0  =  delay  through  an  OR  gate 
Tmx  =  delay  through  a  multiplexor. 
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Number  of  bits 

Figure  3.15:  Area-time  products  for  three  modular  adders. 


Because  data  ripples  through  the  two  layers  of  full  adders  diagonally,  not  downward 
through  each  layer,  the  delay  through  the  full  adder  section  is  equal  to  the  number 
of  full  adders  across  plus  the  number  of  layers  after  the  first  one.  Now,  assuming 
Tfa  »  T0,  Tmx,  the  delay  through  the  adder  can  be  approximated  by 


TRNSadd  ~  in  +  1  )Tja-  (3.40) 

The  area- time  product  for  a  modulo-m,  adder,  where  2n_1  <  m,  <  2n.  can  now  be 
approximated  by 

AT/wsadd  =  2  n(n  +  IjAfwsaddTRNSidd-  (3-41) 

The  area- time  products  for  the  three  types  of  modular  adders  are  plotted  together 
in  Figure  3.15.  Each  of  these  adders  has  a  region  in  which  it  is  superior  to  the  other 
two.  The  adders  using  ROMs  are  better  than  the  one  that  uses  dual  adders  for  moduli 
less  than  23.  Moduli  larger  than  23  require  at  least  4  bits,  in  which  case  the  adder 
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using  dual  n-bit  adders  dominates.  Since  the  dual  adder  type  dominates  for  moduli 
of  at  least  4  bits  and  is  almost  equal  to  the  other  types  of  adders  for  3-bit  moduli,  it 
will  be  the  one  used  for  comparisons  against  integer  adders. 


3.2.2  RNS  Multipliers 

There  are  several  methods  of  performing  RNS  multiplication  that  have  been  discussed 
in  the  literature,  some  of  which  were  explained  in  Chapter  1.  The  first  and  simplest 
method  is  to  use  a  ROM  look-up  table;  the  structure  would  be  identical  to  the  ROM 
adder  of  Figure  3.12  except  that  it  would  be  programmed  for  multiplication  instead  of 
addition.  This  type  of  multiplier  has  the  advantage  of  running  as  fast  as  the  ROM’s 
access  time,  but  the  ROMs  take  up  a  large  amount  of  area  for  larger  moduli.  If  the 
modulus  has  n  bits,  the  address  of  the  ROM  will  contain  2 n  bits.  Moduli  of  5  bits 
(no  larger  than  32)  require  a  ROM  of  lKx  5  bits.  Each  additional  bit  in  the  moduli 
quadruples  the  size  of  the  required  ROM. 

To  alleviate  this  problem,  two  other  types  of  multipliers  have  been  proposed. 
The  first,  discussed  by  Jullien,  works  for  prime  moduli  m,  and  uses  index  calculus 
as  shown  in  Figure  3.16.  A  one-to-one  “logarithmic”  mapping  is  defined  between 
{<?„}  =  { 1  •  •  •  (m,  —  1)}  and  {&„}  =  {0 . . .  (m;  -  2)}  via  a  primitive  root  a  such  that 

=  |arfc"|m,.  (3.42) 


Multiplication  of  the  residues  gn  corresponds  to  addition  of  the  exponents  kn  modulo 


m,  —  T. 


\9n9j\m,  = 


Multiplication  is  done  by  looking  up  the  two  indices  of  the  multiplicands  in  ROMs, 
adding  them  modulo  m,  -  1,  and  then  using  a  ROM  to  get  the  product  from  the 
resulting  index  [8].  Additional  circuitry  is  needed  to  detect  a  zero  since  it  has  no 
index.  This  procedure  is  analogous  to  performing  multiplication  using  logarithms. 
The  hardware  requirements  are  three  ROMs  of  2"  x  n  bits  and  an  n-bit  modular 
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Figure  3. 


:  Modulo  m,  multiplier  using  index  calculus 
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Figure  3.17:  Modulo  m,  multiplier  using  quarter  squares  identity. 


adder;  the  time  required  for  the  process  is  equai  to  twice  the  delay  through  one  ROM 
plus  the  delay  through  one  adder. 

Another  modular  multiplier  which  reduces  ROM  sizes  is  the  quarter  squares  mul¬ 
tiplier  presented  by  Soderstrand  and  Vernia  and  shown  in  Figure  3.17;  it  takes  ad¬ 
vantage  of  the  quarter  squares  identity  mentioned  in  Chapter  1  [13]: 


xy  = 


(x  +  y)2  -  (x  -  y)2 
4 


(3.43) 


Because  the  multiplicative  inverse  of  4  does  not  exist  for  even  moduli,  this  exact 
procedure  cannot  be  used  for  even  moduli.  However,  Taylor  has  shown  how  this 
multiplier  can  be  modified  for  even  moduli  [15],  This  usually  would  not  pose  a 
problem  anyway  because  the  best  choice  for  an  even  modulus  among  a  set  of  moduli 
would  be  2",  so  that  conventional  binary  circuitry  could  be  used  and  carries  beyond 
n  bits  disregarded.  The  hardware  cost  of  the  quarter  squares  approach  is  three  n-bit 
modular  adders  and  two  ROMs  of  2 n  x  n  bits;  the  time  required  for  the  calculation 
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Figure  3.18:  Modified  modulo  m,  multiplier  using  quarter  squares  identity. 

is  twice  the  time  for  a  modular  adder  plus  the  delay  through  a  ROM.  The  quarter 
squares  multiplier  takes  approximately  the  same  amount  of  time  as  the  index  calculus 
multiplier  but  reduces  the  size  by  one  ROM  in  exchange  for  two  more  n-bit  adders. 
Also,  its  use  is  not  restricted  to  prime  moduli. 

Alternatively,  the  quarter  squares  multiplier  could  be  implemented  as  shown 
in  Figure  3.18.  The  initial  addition  and  subtraction  are  performed  with  single  n- 
bit  adders  and  correction  ROMs,  where  the  ROMs  are  programmed  to  correct  the 
sum/difference,  square  it,  and  divide  by  4.  This  method  has  a  possible  advantage 
because  it  eliminates  two  adders  in  exchange  for  an  additional  input  bit  in  each  of 
the  two  ROMs. 

A  fourth  way  of  doing  modular  multiplication  is  to  multiply  each  bit  of  the  multi¬ 
plier  by  the  multiplicand  and  then  add  the  partial  products  with  their  proper  weight¬ 
ings  by  powers  of  2.  This  is  the  same  algorithm  used  in  the  most  simple  integer 
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Figure  3.19:  Modulo  m,  multiplier  using  modular  adders. 

multipliers.  A  diagram  of  this  adder  is  shown  in  Figure  3.19.  The  size  of  this  multi¬ 
plier  is  given  by 

Ap ^ — mult  —  3(n  —  1  )nAfa  4-  2 (n  1 )  Aor  4-  2 (n  1  )fiAmux  4-  u  AaTlj 

%  3(n-l)nA/„.  (3.44) 

Since  all  but  the  first  multiplication  of  x  by  2  is  done  simultaneously  with  the  addi¬ 
tions,  the  time  required  to  complete  the  operation  is 

TFA-m.u  =  nTfa  4-  (n  —  l)((n  4-  1  )T/a  +  Tor  4-  Tmux ) 

v  (n2  -f-  n  -  1  )T/a  (3.45) 

The  area-time  products  for  these  four  multipliers  are  plotted  together  in  Fig¬ 
ure  3.20.  The  simple  ROM  look-up  table  is  best  for  moduli  of  4  bits  or  less,  but 
grows  exponentially  worse  for  larger  moduli.  The  index  multiplier  appears  to  be  the 
best  for  moduli  of  5-8  bits,  but  it  has  the  disadvantage  of  only  working  for  prime 
moduli.  The  next  best  multiplier  for  moduli  of  5-8  bits  is  the  quarter  squares  mul¬ 
tiplier.  The  quarter  squares  multiplier  and  modified  quarter  squares  multiplier  are 
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Area-Time  relative  to  full  adder 


Figure  3.20:  Area- time  products  for  four  modular  multipliers. 


close,  but  the  modified  approach  is  better  in  the  2-6  bit  range,  which  is  the  more 
common  size  of  moduli.  For  4-bit  moduli,  the  modified  quarter  squares  multiplier  is 
close  in  performance  to  the  ROM  look-up;  smaller  moduli  of  2  or  3  bits  are  seldom 
used,  and  if  they  were  used  in  parallel  with  larger  moduli,  they  would  have  a  relatively 
insignificant  contribution  to  the  total  area- time  product.  For  this  reason,  the  mod¬ 
ified  quarter  squares  multiplier  will  be  considered  the  preferred  modular  multiplier 
and  will  be  used  in  comparing  RNS  to  conventional  arithmetic. 
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Figure  3.21:  Integer  multiplier. 

3.3  Conventional  Computational  Elements 

3.3.1  Adders 

A  conventional  n-bit  binary  integer  adder  would  require  n  full  adders  and  have  a 
delay  of  n  times  the  delay  of  each  full  adder,  for  an  area- time  product  of 


lAuT 


(3.46) 


where  Aja  and  T/„  are  as  defined  above. 

3.3.2  Multipliers 

A  simple  integer  multiplier  is  made  of  an  array  of  full  adders  as  shown  in  Figure  3.21. 
Each  row  of  adders  adds  another  partial  product  into  the  result,  with  carries  prop¬ 
agating  diagonally  so  that  they  are  “saved11  and  added  in  at  the  following  row.  Not 
all  of  the  adders  in  Figure  3.21  are  actually  full  adders.  The  first  row  and  the  first 
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adder  in  the  last  row  are  half  adders.  The  total  size  of  the  multiplier  is  given  by 


—  Tl(n  2)  A  fa  +  2  ^  Aand 


The  time  to  compute  the  product  is  dominated  by  the  delay  through  the  adders.  Since 
carries  are  saved,  the  delay  is  equal  to  the  number  of  layers  plus  the  time  required  to 
add  in  all  the  carries  after  the  final  partial  product  is  computed: 

Tmult  =  2nT/a.  (3.48) 

3.4  Side-By-Side  Comparison 

In  a  real  R.N’S  application,  several  modular  adders  would  be  working  in  parallel  to 
replace  a  larger  integer  adder.  To  compare  the  area-time  products  for  these  two 
alternatives,  the  dynamic  range  for  the  integer  adder  must  be  broken  into  a  product 
of  smaller  ranges  for  each  of  the  moduli. 

Let  b  be  the  number  of  bits  in  the  integer  so  that  it  has  a  dynamic  range  of 
2\  If  /  moduli  are  used  to  perform  the  same  addition  with  l  modular  adders,  the 
ideal  size  of  each  modulus  would  be  2(t),  or  b/l  bits.  Of  course,  this  is  not  really 
possible  because  the  /  moduli  must  be  relatively  prime  integers,  which  is  generally 
not  the  case  if  this  simple  formula  is  used:  however,  this  formula  provides  a  simple 
expression  for  determining  the  best  possible  moduli  given  /.  The  following  three 
examples  demonstrate  the  problem  of  picking  the  moduli.  For  the  first,  let  6  =  8  and 
/  =  3.  We  desire  3  moduli  with  approximately  8/3  «  2.67  bits  per  moduli.  Since 
moduli  must  be  integers,  the  largest  modulus  must  have  at  least  3  bits.  In  this  case, 
the  moduli  set  {8,7,5}  will  work.  The  dynamic  range  is  8  •  7  ■  5-  =  280.  The  total 
number  of  bits  required  for  the  moduli  is  3  +  3  +  3  =  9,  1  more  bit  than  for  the  integer 
equivalent.  For  the  second  example,  let  6  =  16  and  /  =  3.  The  largest  modulus  will 
have  16/3  bits,  so  the  set  {64,63,17}  is  chosen.  The  dynamic  range  is  68,544.  This 
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time,  6  +  6  +  5  =  17  bits  are  needed  to  encode  16-bit  integers.  The  last  example  uses 
the  same  dynamic  range  as  the  previous  example,  except  that  five  moduli  will  be  used 
with  the  intention  of  increasing  computational  speed  by  using  smaller  moduli.  We 
need  approximately  16/5  =  3.2  bits  per  moduli.  The  set  {16,15,13,11,3}  will  work  in 
this  case.  Now,  however,  the  total  number  of  bits  required  is4  +  4  +  4  +  4  +  2  =  18, 
resulting  in  2  extra  bits. 


3.4.1  Adders 


With  b/l  bits  per  moduli,  the  RNS  adder  will  have  /  modular  adders,  each  of  which 
has  an  area-time  product  of  2y(|  +  1  )Aja  Tja-  The  larger  integer  adder  has  an  area¬ 
time  product  of  b2AfaTja,  which  means  the  ratio  of  the  RNS  area-time  to  the  integer 
area- time  is 


■2(b  +  l) 
bl 


(3.49) 


RNS  will  have  a  performance  advantage  if  this  ratio  is  less  than  1,  which  holds  if 


/  > 


26 

6-2' 


(3.50) 


In  order  for  this  inequality  to  hold  for  even  the  largest  dynamic  ranges,  there  must  be 
at  least  3  moduli.  The  following  section  will  develop  results  showing  the  performance 
advantage  in  using  RNS  for  multiplication. 


3.4.2  Multipliers 

The  RNS  multiplier  used  for  comparison  is  the  modified  quarter  squares  multiplier. 
There  will  be  a  multiplier  for  each  of  the  l  moduli  of  b/l  bits,  for  a  total  area  of  / 
times  the  area  of  a  j-bit  multiplier  and  speed  of  one  y-bit  multiplier. 

Because  the  area-time  product  for  the  RNS  multiplier  is  much  more  complicated 
than  that  for  the  adder,  a  simple  expression  illustrating  its  performance  advantage 
over  conventional  arithmetic  is  not  possible.  Instead,  Figure  3.22  shows  the  ratio 
of  the  RNS  multiplier’s  area-time  product  to  the  conventional  integer’s  area-time 
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Figure  3.22:  RNS/Integer  multiplier  area-time  ratio  for  different  numbers  of  moduli. 

product.  Each  curve  plots  the  ratio  for  a  different  number  of  moduli.  This  ratio  can 
be  as  low  as  1/20  for  large  multipliers  with  many  moduli. 

3.4.3  Registers 

Registers  are  not  computational  components,  but  because  they  are  used  so  commonly 
to  store  data  temporarily  and  to  pipeline  systems  both  within  and  between  compo¬ 
nents,  their  implementation  should  be  examined.  For  the  most  part,  registers  will 
be  almost  identical  for  integer  and  RNS  systems  that  have  the  same  dynamic  range. 
The  real  difference  is  that  RNS  will  require  one  or  two  more  bits  in  each  register  to 
match  the  dynamic  range  of  an  integer  system.  This  is  apparent  if  one  looks  back 
to  Section  3.4  in  which  it  was  stated  that  a  dynamic  range  of  2b  cannot  truly  be 
factored  into  /  moduli  all  of  size  2**b  It  is  also  impossible  to  make  the  moduli  set  be 
{26‘ ,  2*1 , . . . ,  2b‘ }  such  that  £{_,  b,  =  b  because  only  one  of  the  moduli  may  be  of  the 
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form  24'  if  they  are  to  be  relatively  prime.  The  other  moduli  must  be  prime  numbers 
with  no  common  factors.  Thus,  the  actual  RNS  dynamic  range  is  less  than  the  sum 
of  the  bits  required  to  represent  all  of  the  residues,  and  larger  moduli  with  more  bits 
are  needed  to  realize  the  &-bit  dynamic  range  of  the  integer  system.  In  practice,  this 
usually  amounts  to  one  or  two  more  bits. 
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Chapter  4 


DFT  Algorithms 


The  problem  of  computing  DFTs,  for  which  the  application  of  RNS  arithmetic  is 
being  analyzed,  will  now  be  discussed.  As  explained  in  Chapter  1  there  are  several 
wavs  of  computing  Discrete  Fourier  Transforms  (DFTs),  all  of  which  use  some  divide- 
and-conquer  technique  to  break  the  problem  into  smaller  pieces.  The  DFT  is  defined 
in  Oppenheim  and  Schaefer  [10]  as 

n=0 

where  VF,v  =  e-JV.  A  direct  computation  of  equation  4.1  requires  N  multiplies  and 
N  —  1  additions  for  each  of  the  N  output  points,  for  a  total  operation  count  pro¬ 
portional  to  N2.  Although  the  equation  can  be  manipulated  for  real  sequences  to 
eliminate  trivial  operations,  the  operations  are  generally  complex,  requiring  several 
computations  on  the  real  and  imaginary  parts  of  the  operands.  The  algorithms  dis¬ 
cussed  in  this  chapter  all  attempt  to  reduce  the  number  of  operations  required  to 
compute  the  DFT  by  taking  advantage  of  the  periodic  nature  of  equation  4.1.  It  will 
be  shown  that  the  Winograd  Fourier  Transform  Algorithm  is  the  most  efficient  for 
computing  DFTs  in  the  residue  number  system,  and  that  RNS  arithmetic  has  the 
greatest  comparative  advantage,  if  any,  over  radix-2  arithmetic  for  this  algorithm. 
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4.1  Cooley-Tukey  FFT 


Burrus  and  Parks  [3]  explain  the  derivation  of  the  many  DFT  algorithms,  including 
the  popular  Cooley-Tukey  Fast  Fourier  Transform  (FFT).  The  length  of  an  FFT 
should  ideally  be  a  highly  composite  number,  with  the  most  common  FFTs  having  a 
length  of  the  form  6m,  where  b  is  usually  2.  The  FFT  breaks  the  transform  into  smaller 
pieces  by  factoring  the  length  N  into  its  factors.  Each  layer  of  an  FFT  removes  one 
more  factor  from  the  original  length.  If  the  length  N  is  factored  into  Nx  and  jV2,  then 
the  following  mapping  is  made  for  the  time  and  frequency  indices: 

n  =  N2nx  +  n2 
k  =  kx  +  Nxk2 , 

such  that 

i(ni,n2J  =  x[n]  (4.4) 

*[*i,*2]  =  *[*].  (4.5) 

The  DFT  of  Equation  4.1  can  now  be  written  as 

X[kuk2]  =  £  ‘ff  x[n„  n2]W;\k'  WVk'  W%k\  (4.6) 

n2=0  n\  =0 

Examination  of  Equation  4.6  reveals  that  first  ;Vrpoint  DFTs  are  performed  for 
the  N2  values  of  n2  in  the  inner  summation.  Then,  each  of  the  resulting  points  is 
multiplied  by  a  twiddle  factor,  the  term  .  The  outer  summation  is  an  Appoint 
DFT  v/hich  must  be  computed  for  each  of  the  Nx  values  of  kx.  Since  FFTs  are  usually 
performed  on  sequences  whose  lengths  are  a  power  of  2,  Nx  is  usually  2  and  N2  =  N/2. 
The  iV2-length  DFTs  are  performed  by  doing  another  mapping  and  removing  another 
factor  of  2.  Thus,  all  DFTs  are  reduced  to  2-point  DFTs,  called  butterflies,  which 
are  simply  the  sum  and  difference  of  two  data  points[3]. 

The  resulting  flowgraph  for  an  8- point  FFT  is  shown  in  Figure  4.1.  Because  of  the 
index  mapping,  the  output  points  are  in  bit-reversed  order.  This  algorithm  is  called 


(4.2) 

(4.3) 
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Figure  4.1:  Flowgraph  of  an  8-point  FFT  [10],  p.  602. 


the  decimation-in-frequency  FFT.  If  iV2  were  chosen  to  be  2  so  that  Ni  =  N/ 2,  the 
flowgraph  would  have  evolved  into  the  decimation-in-time  FFT,  in  which  the  input 
points  are  in  bit-reversed  order. 

The  Cooley-Tukey  FFT  greatly  reduces  the  number  of  operations  required  to  com¬ 
pute  the  DFT  from  the  number  required  in  straightforward  evaluation  of  Equation  4.1. 
If  the  length  of  the  DFT  is  N  =  2m,  then  there  will  be  m  layers  in  the  FFT  flowgraph. 
Each  layer  requires  N  complex  additions  (or  subtractions)  and  N/2  complex  multi¬ 
plies,  as  seen  in  Figure  4.1.  Since  m  =  log2  ;V,  the  total  operation  count  is  ;Vlog2  N 
complex  additions  and  y  log2  N  complex  multiplies.  (Actually,  the  last  layer  requires 
no  layers  of  multiplies  so  that  the  number  of  multiplies  is  y(log2  N  —  1)  =  y  log2  y. 
The  expression  y  log2  N  is  generally  used  for  the  number  of  multiplies  because  it 
assumes  a  system  that  is  not  designed  to  detect  this  irregularity.) 

Implementing  the  FFT  with  integer  arithmetic  requires  a  fair  amount  of  scaling 
of  the  data.  Each  layer  has  one  add  and  one  multiply,  which  continually  increase  the 
dynamic  range  required  by  the  system  if  the  numbers  are  not  scaled.  An  addition 
doubles  the  dynamic  range,  or  adds  one  bit.  The  twiddle  factors  are  complex  expo- 
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nentials,  which  reduce  to  sines  and  cosines.  Ideally,  these  multiplies  do  not  change 
the  magnitude  of  the  points;  however,  with  integer  arithmetic,  the  twiddle  factors 
must  be  scaled  up  by  a  factor  of  2  for  each  bit  of  resolution  desired.  Of  course,  in 
two’s  complement  notation  these  extra  bits  can  be  immediately  removed  after  the 
multiplication  by  shifting  the  data  an  equal  number  of  bits  to  the  right,  since  each 
shift  corresponds  to  a  division  by  2.  The  extra  bits  may  be  discarded  (truncated)  or 
used  to  round  the  least  significant  bit. 

Scaling  is  not  so  trivial  with  RNS  arithmetic  because  it  is  not  a  weighted  number 
system,  as  discussed  in  Chapter  2.  It  was  shown  that  the  number  of  operations 
required  to  scale  each  number  is  on  the  order  of  2 L,  where  L  is  the  number  of  moduli. 
Because  of  the  time  required  to  scale  in  the  RNS,  it  should  be  avoided  whenever 
possible.  A  larger  dynamic  range  with  respect  to  the  size  of  the  data  words  will 
reduce  the  amount  of  scaling  that  must  be  done  while  performing  the  transform.  If 
we  let  bd  represent  the  number  of  bits  in  the  data  and  let  bc  represent  the  number  of 
bits  in  the  twiddle  factors,  or  coefficients,  then  the  range  will  grow  by  bc  +  1  in  each 
stage  of  the  FFT.  After  the  first  stage,  the  required  dynamic  range  is  bd  +  bc  +  1  bits. 
The  moduli  chosen  in  an  RNS  implementation  of  the  FFT  must  yield  a  dynamic  range 
at  least  this  large  in  order  to  allow  the  completion  of  one  addition  and  multiplication 
on  a  pair  of  data  points  before  scaling  them  back  to  the  original  range.  If  the  dynamic 
range  were  increased  to  bd  +  2bc  +  2  bits,  then  two  stages  of  the  FFT  may  be  completed 
between  scaling  points. 


4.2  Prime  Factor  Algorithm 

The  prime  factor  algorithm  (PFA)  for  the  DFT  is  also  discussed  by  Burrus  and  Parks. 
Developed  by  Good,  Thomas,  and  Winograd,  this  algorithm  also  maps  a  sequence 
into  a  multi-dimensional  array  so  that  smaller  DFTs  may  be  performed  along  each  of 
these  dimensions.  For  the  PFA,  however,  the  transform  length  is  broken  into  relatively 
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prime  factors.  These  factors  will  be  the  lengths  of  the  dimensions  along  which  the 
smaller  DFTs  will  be  performed.  The  length  is  usually  chosen  so  that  the  factors 
fall  into  a  set  of  lengths  for  which  Winograd  has  developed  a  set  of  small- ./V  DFT 
algorithms  which  minimize  the  number  of  multiplies.  These  small  DFTs  are  usually 
primes  or  powers  of  primes,  the  most  common  being  3,  5,  7,  9,  and  powers  of  2  up  to 
16.  Other  lengths  are  possible  but  are  not  as  efficient.  Algorithms  for  longer  lengths 
become  exponentially  longer  and  more  difficult  to  derive  [3]. 

The  PFA  mappings  for  n  and  k  involve  modular  arithmetic.  If  the  length  N  is 
factored  into  Nx  and  N2,  the  mappings,  given  by  Burrus  and  Parks,  are 

n  —  |-/V2ni  +  rVj  ri2  lw  (4.7) 

*  =  +  l  (4.8) 

Note  that  k,  =  |k|,v,,  and  that  equation  4.8  is  the  Chinese  Remainder  Theorem. 
The  mappings  for  n  and  k  may  also  be  reversed.  These  mappings  load  and  unload 
the  points  along  extended  diagonals  of  a  two-dimensional  array.  The  original  DFT 
equation  now  becomes  [3] 

■V2-l  .V,_l 

X(kl,k2)=  £  E  (T9) 

ri2=0  n\-=0 

This  is  a  pure  two-dimensional  DFT;  each  of  the  summations  is  an  independent  DFT. 
the  order  of  summation  can  be  interchanged,  and  there  are  no  twiddle  factors.  For 
mappings  into  more  than  two  dimensions.  or  .V2,  or  both,  may  be  further  factored 
and  the  PFA  used  to  break  the  .Vj-  and  .V2-point  DFTs  into  smaller  pieces.  A  diagram 
of  the  PFA  is  shown  in  Figure  4.2  for  a  15-point  DFT.  The  sets  of  -‘row”  and  "column'’ 
DFTs  are  done  using  Winograd's  algorithms,  which  consist  of  a  set  of  additions,  a 
set  of  multiplications,  followed  by  another  set  of  additions. 

The  operation  count  for  the  PFA  can  be  found  by  totalling  the  number  of  op¬ 
erations  required  in  each  stage.  The  number  of  operations  required  for  each  of  the 
small  length- N,  DFTs  is  shown  in  Table  4.1.  If  the  transform  length  is  factored  into 
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Figure  4.2:  Two-factor  prime  factor  algorithm  [3],p.  63 


Table  4.1:  Number  of  real  operations  required  for  length-iV  DFT  of  real  data  (double 
for  complex  data)[3]. 


N 

Total  Multiplies 

Multiplies  By  One 

Adds 

2 

2 

2 

2 

3 

3 

1 

6 

4 

4 

4 

8 

5 

6 

1 

17 

7 

9 

1 

36 

8 

8 

6 

26 

9 

11 

1 

43 

16 

18 

8 

74 
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J  dimensions,  such  that  Nx  ■  N2  ■  . . .  ■  Nj  =  N,  then  there  will  be  N/N{  Nx- point 
DFTs  for  i  =  1, . . . ,  J.  If  A,  and  A/,  represent,  respectively,  the  number  of  adds  and 
multiplies  for  the  length- iV^  DFT,  the  total  numbers  of  adds  and  multiplies  are  given 
by 

A  =  ^ 

i=l  JV* 

M  =  (4-11) 

i=i  JV' 

The  scaling  problem  exists  for  the  PFA  just  as  it  does  for  the  FFT.  The  adds 
and  multiplies  within  each  of  the  layers  of  DFTs  increase  the  range  of  the  data.  In 
order  to  determine  how  much  the  range  expands,  more  must  be  known  about  the 
small  DFT  algorithms.  As  explained  above,  there  is  first  a  layer  of  additions,  called  a 
preweave,  that  usually  leads  to  a  small  data  expansion.  This  is  why  the  additions  in 
the  column  DFTs  of  Figure  4.2  are  represented  by  a  trapezoidal  block.  Next  there  is 
a  single  layer  of  multiplies,  in  which  each  of  the  pieces  of  data  out  of  the  first  stage  is 
multiplied  by  a  coefficient  that  is  a  combination  of  twiddle  factors.  This  is  followed 
by  another  set  of  additions,  called  the  postweave,  that  yields  the  original  number  of 
data  points.  Normally,  the  data  grows  by  a  factor  equal  to  the  transform  length  in  a 
DFT.  An  vV,-point  transform  should  therefore  add  log2  Nt  bits  onto  the  range  of  the 
data.  If  the  coefficients  in  the  multiply  stage  are  scaled  up  to  integers  by  multiplying 
them  by  26‘,  then  the  data  will  grow  by  Iog2  N,  +  bc  bits  in  the  zth  stage  of  the  PFA. 
Again,  the  alternatives  available  for  reducing  the  number  of  scaling  operations  are  to 
either  start  with  a  larger  dynamic  range  or  to  use  coefficients  with  fewer  bits,  which 
decreases  accuracy. 


4.3  Winograd  Fourier  Transform  Algorithm 

The  Winograd  Fourier  Transform  Algorithm  (WFTA),  not  to  be  confused  with  Wino- 
grad’s  algorithms  for  computing  short  DFTs,  goes  one  step  beyond  the  PFA  by  nesting 
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Figure  4.3:  A  15-point  WFTA  [3],p.  71. 

the  three  layers  of  operations  required  within  each  of  the  short  DFTs.  The  same  input 
and  output  mappings  used  for  the  PFA  are  also  used  in  the  WFTA.  Taylor  describes 
how  the  preweave,  multiply,  and  postweave  stages  of  the  small  DFTs  can  be  viewed  as 
matrix  operations  on  a  vector  of  points.  Letting  x'  and  X'  be  vectors  of  the  reordered 
input  and  output  points,  respectively,  a  small-iV,  DFT  can  be  written  as 

X'  =  SyCy,Tv,i'.  (4.12) 

where  Ty  and  Sjv,  are  incidence  matrices  (i.e.,  they  contain  small  integers  or  fractions, 
only  ±1  and  0  for  very  short  DFT  lengths),  and  C y  contains  elements  only  along 
its  diagonals.  The  WFTA  rearranges  these  operators  so  that  the  data  points  first 
pass  through  all  the  preweave  layers  of  additions,  then  through  all  the  multiplication 
stages,  and  finally  through  the  postweave  stages(17]: 

X'  =  ( S.Vj  *  •  *  5.v2 S.v,  )(C,\j  *  ■  *  C.VjC/v,  )(7)y,  *  ■  *  Ty, TV,  )x.  (4-13) 

A  diagram  of  this  procedure  for  the  15-point  DFT  is  shown  in  Figure  4.3.  Since  the 
multiplication  stages  are  each  point-by-point  multiplications  of  the  data  by  coefficients 
derived  from  twiddle  factors,  the  separate  layers  of  multiplications  may  be  combined 
into  one  set  of  multiplications  by  coefficients  that  can  be  precalculated.  This  reduction 
to  one  layer  of  multiplications  is  the  primary  advantage  of  the  WFTA. 
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The  operation  count  for  the  WFTA  is  slightly  more  complicated  than  for  the  FFT 
or  the  PFA,  and  it  depends  on  the  order  of  the  stages.  Let  the  length  N  be  broken 
down  into  the  J  factors  N\,...,Nj ,  which  are  the  lengths  of  the  small  DFTs.  We 
will  also  assume  that  this  also  represents  the  order  of  the  different  stages.  Let  Ax  and 
Mi  represent  the  number  of  adds  and  multiplies,  respectfully,  in  the  ith  stage.  M,  is 
also  the  number  of  intermediate  points,  due  to  data  expansion,  in  one  of  the  Appoint 
DFTs.  The  total  number  of  adds  is  given  by 

j- 1  /»— i 

a = £  ( n 

1  =  1  \k=l  k: 


(4.14) 


The  number  of  multiplies  is  simply  the  product  of  the  multiply  count  for  each  of  the 
short  DFTs: 

J- 1 

M  =  II  Mi.  (4.15) 

Since  the  multiplies  in  the  WFTA  are  nested  together,  multiplications  by  one  in  the 
short  DFTs  must  be  counted  since  the  corresponding  factors  in  the  following  stages 
are  generally  not  one. 

Scaling  is  not  so  big  a  problem  with  the  WFTA  as  it  is  in  the  FFT  and  PFA 
because  there  is  only  one  layer  of  multiplications  by  coefficients  scaled  up  to  integer 
values.  Using  the  same  notation  as  above,  the  total  dynamic  range  required  for  the 
WFTA  (in  bits)  is 

b  =  bj  -f-  bc  +  log2  N.  (4.16) 


The  ideal  point  at  which  to  scale  the  data  is  after  the  multiplication  stage  or  after 
the  entire  transform. 


4.4  Comparison  of  Algorithms 

Table  4.2  shows  the  number  of  operations  required  by  the  three  algorithms  for  varying 
DFT  lengths.  Implementations  of  the  three  DFT  algorithms  in  the  RNS  have  been 
compared  by  Taylor.  Because  scaling  is  the  primary  weakness  of  the  RNS,  Taylor 
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Table  4.2:  Comparison  of  operation  counts  for  DFTs  of  complex  data. 


■ 

Factors 

FFT 

Multiplies 

Adds 

PFA 

Multiplies 

Adds 

WFTA 

Multiplies  Adds 

63 

9-7 

284 

1236 

198 

1394 

64 

26 

768 

1152 

120 

8-3-5 

460 

2076 

288 

2076 

126 

2-9-7 

568 

2724 

396 

3040 

128 

27 

1792 

2688 

240 

16-3-5 

1100 

4812 

648 

5136 

252 

4-9-7 

1136 

5952 

792 

6584 

256 

28 

4096 

6144 

504 

8-9-7 

2524 

13164 

1584 

14428 

512 

29 

9216 

13824 

1008 

16-9-7 

5804 

29100 

3564 

34416 

1024 

2i° 

20480 

30720 

2048 

2n 

45056 

67584 

2520 

8  •  9  •  7  ■  5 

17660 

82956 

9504 

99068 
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bases  his  comparison  of  the  three  algorithms  on  the  ratio  of  transform  operations 
to  scaling  operations.  He  shows  that  because  the  WFTA  has  the  smallest  ratio  of 
scaling  operations  to  transform  operations,  the  RNS  will  have  the  greatest  compara¬ 
tive  advantage  over  a  two’s  complement  system  for  this  algorithm.  In  fact,  if  scaling 
were  to  be  done  after  the  entire  WFTA  and  if  no  more  processing  were  required  after 
the  transform,  the  scaling  could  be  effectively  eliminated  by  incorporating  it  into  the 
RNS-to-two’s  complement  conversion  algorithm[17].  The  advantage  of  the  WFTA  is 
apparent  from  the  previous  parts  of  this  chapter  and  from  Table  4.2  in  which  it  is 
shown  that  the  WFTA  minimizes  the  number  of  multiplies.  Although  the  RNS  is 
generally  very  efficient  at  multiplication,  these  multiplications  are  by  small  constants 
which  may  each  have  been  scaled  up  by  approximately  6-8  bits,  depending  on  the  de¬ 
sired  accuracy.  The  main  disadvantages  of  the  WFTA  are  that  it  requires  a  complex 
reordering  of  the  data  at  the  input  and  output  and  that  it  is  not  in-place  because  of 
the  data  expansion  in  the  multiplication  stage.  Also,  because  the  coefficients  in  the 
multiplication  stage  are  determined  by  nesting  and  combining  the  multiplication  lay¬ 
ers  in  the  small  DFTs,  a  new  routine  must  be  developed  for  every  transform  length, 
unlike  the  Fi’T.  These  are  the  primary  reasons  why  implementations  of  this  algorithm 
are  uncommon  even  in  conventional  two's  complement  arithmetic.  However,  since  the 
objective  he*e  is  to  determine  the  relative  advantage  of  using  RNS,  the  WFTA  is  an 
ideal  focus  f  <r  comparison  because  it  provides  the  greatest  possible  chance  for  the 
RNS  to  outp  .-rform  two’s  complement  arithmetic.  Also,  since  the  primary  reason  for 
using  RNS  would  be  to  increase  performance  despite  the  difficulty  of  using  uncon¬ 
ventional  arit  vmetic,  the  WFTA  may  be  a  more  likely  choice  because  of  its  minimal 
multiplication  count. 
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Chapter  5 

Performance  Comparison 


The  performance  of  RNS  and  two’s  complement  systems  for  computing  DFTs  will 
now  be  compared  using  the  area-time  metric.  Chapter  3  developed  expressions  for  the 
area-time  products  of  the  basic  components  in  RNS  and  two's  complement  integer 
arithmetic.  Chapter  4  discussed  the  possible  DFT  algorithms  and  the  Winograd 
Fourier  Transform  Algorithm  was  found  to  be  the  most  appropriate  for  RNS.  In 
addition,  operation  counts  were  found  for  the  different  size  WFTAs.  The  number 
of  adds  and  multiplies  required  for  the  transform,  plus  the  additional  operations 
required  for  conversion  and  scaling,  are  now  combined  with  the  area-time  measures 
for  the  individual  components  to  determine  such  a  measure  for  the  two  systems. 


5.1  System  Area-Time  Products 

The  area-time  products  for  the  systems  being  compared  will  take  into  account  the 
addition  and  multiplication  operations  only.  Area-time  products  for  the  adds  and 
multiplies  can  first  be  developed  separately: 

(AT)add  =  (AT)add„  x  #adds  (5.1) 

{AT)muit  =  ( AT'jmxiitipiitr  x  $ multiplies .  (5-2) 
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These  approximations  are  relatively  straightforward.  It  would  take  a  single  adder 
a  time  equal  to  the  number  of  adds  times  the  time  per  add  to  do  all  the  adds. 
Alternatively,  assuming  the  adds  could  be  ordered  properly,  k  adders  could  perform 
the  same  number  of  adds  in  £  times  the  above  time,  but  with  k  times  the  size. 

The  other  part  of  the  problem  is  to  combine  {AT)add  and  (AT)muit  into  a  sin¬ 
gle  figure.  We  will  add  these  two  quantities,  which  reflects  the  area-time  for  the 
most  efficient  system,  in  which  the  hardware  is  perfectly  allocated  between  additions 
and  multiplications.  Since  the  problem  is  not  finished  until  both  the  additions  and 
multiplications  are  done,  we  are  interested  in  finding 

AT  —  min( ( Aadd  4-  Amtii^  max^ Taddi  Tmuit')') •  (5.3) 

Assuming  that  AopTop  is  constant  for  both  addition  and  multiplication,  the  total 
area-time,  AT.  will  be  minimized  when 

T'add  ~  Tmuh •  (5-4) 

If  one  of  these  times  were  smaller  than  the  other,  the  corresponding  area  could  be 
decreased  without  affecting  the  system’s  time,  thereby  reducing  AT  further.  The 
desire  to  make  the  adds  and  multiplies  take  the  same  amount  of  time  so  that  they 
may  be  done  together  may  seem  contradictory,  since  the  WFTA,  as  explained  in 
Chapter  4,  ends  with  a  set  of  output  additions.  However,  when  there  is  a  constant 
stream  of  data  and  DFTs  to  perform,  making  these  two  sets  operations  take  equal 
times  minimizes  the  average  time  for  an  efficient  program  and  controller.  When 
T^dd  =  Tmuit,  Equation  5.3  reduces  to 

AT  =  {AT)  add  +  ( AT)muit.  (5.5) 

5.2  Analysis  of  Problems 

The  example  DFT  problems  were  analyzed  by  breaking  the  problem  into  several  parts. 
First,  the  size  of  the  DFT,  number  of  bits,  and  RNS  system  are  defined.  The  number 
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of  bits  in  the  data  and  coefficients,  bj  and  bc  respectively,  can  be  chosen  depending 
on  the  resolution  desired.  The  required  number  of  bits  for  an  iV-point  WFTA,  given 
by  Equation  4.16,  is 

b  =  bd  +  bc  +  log2  N.  (5.6) 

The  next  step  is  to  determine  the  number  and  size  of  the  moduli  for  an  RNS  system. 
To  show  how  the  performance  changes  with  the  number  of  moduli,  we  will  let  l  vary, 
usually  from  3  to  8  moduli.  A  simple  approximation  for  the  size  of  the  required 
moduli  is  f  bits. 

With  the  system  defined,  the  next  step  is  to  determine  the  area-time  products  for 
the  modular  adders  and  multipliers.  The  general  formula  is 

^component  =  0  '  size  component)  •  (time  for  component).  (5.7) 

The  same  is  also  done  for  two’s  complement  components,  using  the  original  6-bit 
range.  The  number  of  operations  to  be  performed  by  the  components  will  depend  on 
the  size  of  the  DFT  and  on  the  number  system.  For  two’s  complement  arithmetic, 
the  number  of  operations  are  given  in  Table  4.2  and  can  be  represented  by  Aqft  and 
Mdft-  For  RNS,  there  is  also  the  requirement  to  convert  and  scale  the  data.  For 
complex  data,  conversion  to  RNS  requires  one  ROM  lookup  per  point.  Following  the 
DFT,  the  data  may  be  scaled  or  converted  to  two’s  complement,  but  usually  not  both. 
(If  so,  the  scaling  may  be  done  after  conversion,  at  which  point  it  is  essentially  a  ‘‘free” 
operation.)  Either  of  these  processes  takes  on  the  order  of  /  adds  and  multiplies  in 
each  RNS  channel  per  point.  A  mixed-radix  conversion  actually  requires  /  —  1  of  each 
operation,  but  requires  /  —  1  more  after  the  coefficients  are  found.  A  safe  assumption 
is  that  there  are  at  least  /  operations,  the  number  required  for  scaling.  Since  we 
have  complex  data,  there  are  21  adds  and  the  same  number  of  multiplies.  The  total 
numbers  of  RNS  operations  are  given  below: 

Addfiss  ~  AddpFT  +  21  ■  #  points  (5-8) 

MuUrns  =  Mult dft  +  2/  •  #  points.  (5.9) 
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Figure  5.1:  Area-time  products  for  systems  performing  WFTAs. 

The  area-time  products  for  the  two  systems  can  now  be  found  by  multiplying  the 
area- time  for  each  component  by  the  number  of  corresponding  operations  and  adding 
the  area-time  products  for  the  two  types  of  operations. 

Results  comparing  RNS  and  integer  systems  are  shown  in  Figure  5.1.  T)  ese  results 
were  generated  with  Matlab;  the  script  file  is  included  in  the  appendix.  The  lines 
show  the  area-time  products  versus  transform  length  for  different  numbers  of  moduli 
and  for  a  two's  complement  integer  system.  The  number  of  bits  in  the  data,  bj,  and 
in  the  coefficients.  bc,  is  assumed  to  be  8.  The  plot  suggests  that  RNS  can  provide 
a  significant  performance  advantage  if  at  least  4  moduli  are  used.  However,  these 
results  were  generated  with  the  ideal  assumption  that  all  of  the  moduli  had  b/l  bits. 

The  results  of  Figure  5.1  may  be  made  more  realistic  by  changing  the  way  the 
moduli  are  selected.  In  Chapter  3,  it  was  noted  that  y  may  not  be  an  integer  and 
that  not  all  of  the  moduli  can  be  of  the  same.  The  moduli  selection  may  be  made 
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the  number  of  moduli.  The  Matlab  script  file  is  again  included  in  the  appendix.  Note 
that  the  RNS  with  at  least  4  moduli  can  still  outperform  conventional  arithmetic,  but 
the  performance  advantage  has  been  reduced.  For  the  504-point  DFT  with  7  moduli, 
the  advantage  has  been  reduced  from  a  factor  of  4  to  a  factor  of  3. 

We  will  now  further  restrict  the  selection  of  moduli  by  recognizing  that  not  only 
must  the  moduli  be  different,  but  they  may  not  all  be  of  the  form  26, .  The  first 
modulus  is  usually  chosen  to  be  of  this  type,  and  the  second  to  be  of  the  form  26,  —  1, 
but  other  moduli  can  not  be  even  or  contain  any  other  factors  in  common  with  the 
previous  moduli.  This  means  that  there  is  some  penalty  in  terms  of  dynamic  range  loss 
from  the  ideal  6,-bits,  as  discussed  in  Chapter  3.  This  penalty  is  much  more  significant 
for  large  numbers  of  moduli.  The  actual  penalty  depends  on  the  desired  range  and 
exact  number  of  moduli,  but  an  average  provides  a  good  approximation.  Studying  a 
few  examples  for  20-30  bit  ranges  has  shown  that  the  average  penalty  is  about  0.2 
bits  for  5-moduli  systems,  0.25  bits  for  6  moduli,  and  0.43  bits  for  7  moduli.  RNS 
moduli  can  be  chosen  to  provide  an  adequate  dynamic  range  for  different  problems, 
taking  into  account  this  loss;  the  results  for  this  approach  are  shown  in  Figure  5.3. 
The  maximum  performance  advantage  has  now  been  reduced  to  about  a  factor  of  2.7. 

A  final  plot  is  now  included  which  compares  the  results  when  the  two’s  complement 
system  is  made  more  realistic  by  recognizing  that  it  need  not  have  the  same  range  as 
RNS  because  of  the  ability  to  scale  quickly  whenever  needed.  The  required  number 
of  bits  for  a  minimal  amount  of  scaling  is  approximately  the  original  number  of  bits, 
bd,  plus  the  maximum  of  bc  and  log2  N.  This  allows  for  the  completion  of  at  least  the 
preweave  stage  before  scaling,  but  also  provides  enough  dynamic  range  to  complete 
the  multiplication.  These  results  are  shown  in  Figure  5.4.  Under  this  scheme,  the 
integer  system  has  approximately  .87  times  the  area-time  of  the  best  RNS  system  for 
a  504-point  DFT. 

The  analyses  in  this  chapter  have  compared  RNS  to  two’s  complement  systems  by 
calculating  an  area-time  product  based  on  the  area-time  products  of  the  basic  com- 
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Figure  5.3:  Comparison  of  two’s  complement  and  RNS  systems  using  dynamic  range 
penalty  in  moduli  selection. 
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Figure  5.4:  Comparison  of  two’s  complement  and  RNS  systems  using  smaller  dynamic 
range  two’s  complement. 


80 


putational  components  and  the  numbers  of  the  different  operations.  The  computed 
values  may  not  be  obtainable  in  practice  because  there  has  been  no  accounting  for 
required  registers,  controllers,  and  interconnect.  However,  as  previously  stated,  our 
goal  is  to  determine  whether  RNS  has  an  advantage  based  purely  on  its  efficiency 
in  performing  arithmetic.  Non-computational  components  have  not  been  addressed 
because  these  parts  would  be  nearly  identical  in  the  two  types  of  systems.  While 
RNS  is  competitive  with  two’s  complement  arithmetic  in  a  strict  “apples-and-apples” 
comparison  of  efficiency  in  performing  a  WFTA,  RNS  is  the  not  the  clear  winner  when 
realistic  assumptions  are  made  about  how  such  a  problem  would  really  be  solved  in 
two’s  complement.  The  most  important  difference  is  that  a  two’s  complement  system 
would  not  have  to  begin  with  a  dynamic  range  large  enough  to  solve  the  problem 
without  scaling  in  order  to  successfully  compute  the  DFT. 
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Chapter  6 


Conclusions 

6.1  Performance  of  RNS 

The  results  of  Chapter  5  show  that  RNS  is  not  a  clear  winner  over  two's  complement 
for  the  problem  of  performing  WFTAs.  RNS  initially  appeared  to  have  a  significant 
advantage,  largely  due  to  the  small  size  and  delay  of  the  ROMs  used  in  the  quarter 
squares  multiplier.  In  Chapter  3  it  was  shown  how  favorably  relatively  small  ROMs 
compared  to  full  adders  in  terms  of  size  and  speed.  However,  this  advantage  is 
outweighed  by  the  disadvantage  of  not  having  an  efficient  method  of  scaling  and  by 
the  necessity  of  converting  data  back  to  two's  complement  through  a  lengthy  series 
of  operations. 

These  results  for  the  WFTA  indicate  that  RNS  does  not  provide  an  advantage 
over  two’s  complement  for  the  general  problem  of  computing  a  DFT.  In  Chapter  4  it 
was  shown  that  among  the  most  efficient  DFT  algorithms  known,  the  WFTA  provides 
RNS  with  the  greatest  chance  of  outperforming  two’s  complement.  This  is  because 
there  is  only  one  layer  of  multiplies  along  the  data  path,  thus  minimizing  the  growth 
in  dynamic  range  due  to  multiplication  by  constants  scaled  up  to  the  desired  number 
of  bits  of  resolution.  Since  RNS  does  not  appear  to  be  more  efficient  than  two's 
complement  for  WFTAs,  it  will  not  be  better  for  other  algorithms,  such  as  the  PFA 
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Figure  6.1:  General  block  diagram  of  system  using  distributed  arithmetic. 


and  FFT. 


6.2  New  Ideas 

The  size  and  speed  of  relatively  small  ROMs  suggests  that  they  may  be  used  to  replace 
computational  components  working  on  conventional  two’s  complement  arithmetic,  if 
they  can  be  used  in  situations  where  a  small  addressing  space  is  possible.  A  possible 
application  is  in  distributed  arithmetic,  in  which  data  to  be  processed  is  fed  into 
components  in  a  bit-serial  fashion.  The  components  process  the  data  and  continuously 
shift  and  add  in  results  for  successive  bits  such  that  there  is  also  a  constant  stream 
of  bits  coming  out.  These  bits  can  be  immediately  used  by  the  next  component,  even 
though  the  previous  component  is  still  working  on  the  more  significant  bits  of  the 
same  piece  of  data.  A  general  example  of  this  technique  is  shown  in  Figure  6.1. 

Distributed  arithmetic  may  be  used  with  ROMs  to  implement  the  series  of  opera¬ 
tions  needed  to  compute  a  radix-2  or  radix-4  butterfly  used  in  a  Cooley-Tukey  FFT. 
A  diagram  of  a  radix-4  butterfly  implemented  in  distributed  arithmetic  is  shown  in 
Figure  6.2.  Here,  ROMs  are  used  to  multiply  three  of  the  four  outputs  by  precom- 
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Figure  6.2:  Radix-4  butterfly  using  distributed  arithmetic. 

puted  twiddle  factors.  Since  the  data  comes  in  one  bit  at  a  time,  the  ROMs  need 
only  contain  two  address  lines— one  for  the  real  part  and  one  for  the  imaginary.  As¬ 
suming  that  the  twiddle  factors  are  represented  by  8-bit  two’s  complement  numbers, 
the  ROMs  will  output  9- bit  numbers  representing  either  the  real  or  imaginary  part  of 
the  product  of  the  twiddle  factor  with  the  real  and  imaginary  parts  of  the  data  repre¬ 
sented  by  the  two  current  bits.  The  ROMs  produce  9-bit  products  because  each  part 
of  the  complex  product  requires  a  sum  of  two  8-bit  products.  Since  each  of  the  two 
layers  of  adds  before  the  complex  multiplication  add  another  bit  to  the  result,  there 
is  a  total  growth  of  11  bits  in  the  range  of  the  data  passing  through  the  component. 

The  area  and  time  of  an  FFT  implemented  by  connecting  cells  of  the  form  shown  in 
Figure  6.2  into  a  large  array  can  now  be  computed.  We  will  calculate  these  measures 
for  a  256-point  FFT,  a  convenient  size  since  44  =  256.  The  butterfly  can  be  broken 
down  into 

•  16  full  adders/registers  for  the  8  complex  adds  in  the  butterfly, 

•  2(6C  +  1)  registers  to  delay  the  first  complex  output,  multiplied  by  1, 

•  6  ROM/accumulate  systems  for  the  results  of  the  three  complex  multiplications. 
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each  with  the  following: 

-  one  22  x  ( bc  +  l)-bit  ROM 

-  a  (bc  4-  l)-bit  full  adder 

-  2 (bc  +  1)  registers. 

The  area  of  the  butterfly  is 

(16  +  6(bc  -f  l))full  adders  +  (16  +  14(6C  +  l))registers  +  6(22  x  ( bc  +  l))-bit  ROMs. 

We  will  assume  that  a  l-bit  register  is  about  the  same  size  as  a  full  adder  and  that 
bc  =  8,  as  before.  From  the  formulas  developed  earlier,  the  size  of  a  (22  x  9)-bit  ROM 
is  about  1.12  full  adders.  The  above  expression  reduces  to  an  area  of  219  full  adders. 
A  256-point  FFT  using  radix-4  butterflies  will  have  log4256  =  4  layers,  each  with 
~  =  64  butterflies.  The  total  area  is 

(64  •  4)  x  219  =  56,064  full  adders. 

The  time  estimate  begins  with  an  estimate  of  the  clocking  period.  The  access  time 
for  the  given  ROMs  is  approximately  .45  times  the  delay  of  a  full  adder.  The  worst 
case  delay  is  one  ROM  look-up,  a  one-bit  add,  and  storage  in  a  register.  This  total 
delay  can  be  estimated  to  be  the  delay  of  2.5  full  adders,  assuming  the  register  is  the 
same  speed  as  a  full  adder.  It  was  found  above  that  there  is  a  growth  of  11  bits  in 
the  range  of  numbers  passing  through  each  butterfly.  Since  there  are  4  layers,  the 
total  growth  is  44  bits.  Assuming  the  original  data  has  8  bits,  it  will  take  a  total  of 
52  clock  cycles  for  the  last  bit  of  data  to  come  out.  Bits  9-52  of  the  input  will  be 
sign  extended.  The  total  time  to  perform  the  FFT  is  now 

52  •  2.5  =  130  full  adder  delays. 

The  area-time  product  is 


56,064  ■  130  =  7,288,320(A7>4 


85 


In  Chapter  5,  it  was  found  that  the  best  area-time  product  for  an  RNS  system 
doing  a  comparable  WFTA  (252  points)  is  10,011,000.  The  best  case  is  when  5  moduli 
are  used.  When  the  two’s  complement  system  is  modified  to  allow  a  smaller  dynamic 
range  in  exchange  for  scaling,  it  achieves  an  area-time  product  of  7,749,056  for  the 
504-point  WFTA.  Neither  of  these  two  measures  include  the  cost  of  the  registers 
required  to  connect  or  pipeline  the  computational  components,  yet  the  distributed 
arithmetic  system,  which  takes  registers  into  account,  outperforms  both.  In  addition, 
the  distributed  arithmetic  system  is  implementing  the  DFT  via  the  FFT,  instead  of 
the  WFTA.  The  measures  above  account  for  arithmetic  computations  only,  so  the 
systems  using  the  WFTA  have  an  inherent  advantage.  The  FFT  implementation 
described  above,  however,  is  computationally  more  efficient,  yet  does  not  require  the 
complicated  reordering  of  the  input  and  output  points  as  does  the  WFTA.  Also,  if 
we  allow  rounding  in  intermediate  stages,  then  the  time  per  problem  is  much  less 
because  it  will  not  be  necessary  to  wait  for  52  bits  of  output. 


6.3  Suggestions  for  Further  Research 

A  promising  area  for  further  research  now  is  the  use  of  distributive  arithmetic  for 
computing  DFTs.  The  quick  analysis  above  shows  that  this  approach  appears  to 
have  major  advantages  over  conventional  arithmetic,  especially  when  ROMs  are  pro¬ 
grammed  to  implement  operations  such  as  complex  multiplication,  which  normally 
requires  a  series  of  operations  on  just  a  few  sets  of  inputs.  With  more  work,  this 
technique  may  be  refined  to  further  reduce  the  size  and  time  required  by  the  system. 

The  RNS  does  not  show  any  clear  advantages  over  two’s  complement  for  the 
problem  of  computing  DFTs.  The  major  disadvantage  is  the  growth  in  dynamic  range 
when  DFTs  are  calculated.  This  growth  is  worse  for  RNS  because  the  twiddle  factors 
of  the  DFTs  must  be  scaled  up  to  integer  values  but  cannot  be  easily  scaled  back  down. 
RNS  would  be  more  suitable  for  problems  that  deal  mainly  with  integers  and  that  do 
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not  pose  dynamic  range  problems.  One  application  that  has  been  considered  is  in  the 
design  of  an  address  generator  for  a  system  using  the  PFA  or  the  WFTA.  The  input 
and  output  reordering  equations  for  these  algorithms  were  presented  in  Chapter  4  and 
were  found  to  be  based  fundamentally  on  modular  arithmetic.  The  different  modular 
components  could  compute  the  parts  of  multi-dimensional  addresses  independently 
for  a  memory  oriented  towards  this  kind  of  addressing.  This  application  would  also 
deal  exclusively  with  integers  and  would  not  pose  problems  of  dynamic  range. 
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Appendix  A 
Matlab  M-files 
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function  a*arom(no) 

ni=*[  no’  (no+ones(no) ) ’  2*no']’; 

no*[l; 1; l]*no; 

c*round(0 .5*ni-0 . 5*  (log (no) /log (2))  -  .  0001) ; 

r*round(0 .5*ni+0 . 5* (log (no) /log (2) )+ . 0001) ; 

'/.ROM  layout  area  in  lambda  units 

wl=16*r+27.*ones(no) ; 

w2=9 . *no . *exp(c*log(2) )+2 . *ones (no) ; 

w=wl+w2; 

hl=exp(r*log(2) ) *8+27 ; 

h2=18*c; 

h=hl+h2; 

a=w.*h/7676;  '/.scale  to  area  of  full  adder 
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function  a*aromm(ni,no) 

c*round(0 .5*ni-0 . 5*(log(no)/log(2))- .0001) ; 
r*round(0 . 5*ni+0 . 5* (log (no) /log (2) ) + .0001) ; 
y.RQM  layout  area  in  lambda  units 
vl*16*r+27; 

w2=9 . *no . *exp(c*log(2))+2; 
w=vl+w2; 

hl=exp(r*log(2) ) *8+27 ; 

h2=18*c; 

h=hl+h2; 

a=w.*h/7676;  '/.scale  to  area  of  full  adder 
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function  t=trom(no) 

ni*  [no  ’  (no+ones  (no)  )  '  2*no  O'; 

no»[l;l;l]*no; 

c=round(0 . 5*ni-0 . S* (log (no) /log (2) ) - -  0001) ; 
r=round(0 . 5*ni+0 . 5* (log (no) /log (2) ) + . 0001) ; 

XROM  time  in  nanoseconds 
trow»5.77e-4*exp(r*log(4))  + .0389*r; 

trom=5.77e-4*(no.*no) .*exp(c*log(4))+ .0389*exp(r*log(2)) ; 
tcol=(ones(no)+2*c) . *c*0.0493; 

t=(trow+trom+tcol)/l . 168;  ‘/.scale  to  time  full  adder 
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function  t=trom(ni ,no) 

c=round(0 . 5*ni-0 . 5* (log (no) /log (2) )- . 0001) ; 
r=round(0 . 5*ni+0 . 5* (log (no) /log (2) )+ . 0001) ; 

'/.ROM  time  in  nanoseconds 

trov=5 .77e-4*exp(r*log(4) )  + . 0389*r; 

trom=5 .77e-4*(no . *no) . *exp(c*log(4) )+ . Q389*exp (r*log(2) ) ; 
tcol=(l+2*c) .*c*0.0493; 

t=(trow+trom+tcol)/l . 168;  '/.scale  to  time  full  adder 
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y,  layout. m 

%  computes  area,  time,  and  product  for  ROMs  of  various  sizes 
no=2:18;  X  #  of  bits  (different  cases) 
xax*no ; 

ni=[  no’  (no+ones(no)) ’  (2*no)’]’; 
cases® [  1  ;  1;  i] ; 
no=cases*no; 
nls*ones(no) ; 

a=7676;  '/,  area  of  full  adder  in  lamda**2 
t=1.168;  !  delay  through  full  adder  in  nsec 

c=round(0.5*ni-0.5*(log(no)/log(2))~ .0001) ; 
r=round(0 .5*ni+0 . 5* (log (no) /log (2) )+ . 0001) ; 

'/.ROM  layout  area  in  lambda  units 
wl=16*r+27 . *ones (no) ; 
w2=9 . *no . *exp(c*log(2) )+2 . *ones (no) ; 
u=w l+w2 ; 

hl=exp(r*log(2) ) *8+27 ; 

h2=*18*c; 

h=hl+h2 ; 

arom=v.*h;  ‘/.scale  to  area  of  full  adder 

'/.ROM  time  in  nanoseconds 

trow=5 . 77e-4*exp(r*log(4) )  + . 0389*r ; 

trom=»5 ,77e-4*(no . *no) . *exp(c*log(4) )+ . 0389*exp(r*log(2) ) ; 
tcol®(nls+2*c) ,*c*0 .0493; 

trom=(trow+trom+tcol) ;  Xscale  to  time  full  adder 

atromsarom . *trom ; 
axis ([2, 18 ,3, 13)) ; 
semilogy(xax,arom) ; 
xlabel(’Bits  in  ROM’); 
ylabel(’Size  (square  lambdas)’); 
grid; 

!del  arom.met 
meta  arom; 
pause; 
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axis([2,18,-l,9]) ; 
semilogy(xax,trom) ; 
xlabel('Bits  in  ROM'); 
ylabel (’ Speed  (ns)'); 
grid; 

!del  trom.met 
met a  trom; 
pause; 

axis([2,18,3,21]) ; 
semilogy(xax,atrom) ; 
xlabelC'Bits  in  ROM'); 

ylabel ('Area-Time  (ns*square  lambdas)’); 

grid; 

pause; 

arom»arom/a ; 
trom*trom/t ; 
atrom*atrom/(a*t) ; 
axis([2,18,-l,9]) ; 
semilogy(xax,arom) ; 
xlabeK’Bits  in  ROM'); 
ylabel ('Size  (#  FAs)’); 
grid; 

!del  aromr.met 
met a  aromr; 
pause ; 

axis([2, 18, -1,9] ) ; 
semilogy(xax,trom) ; 
xlabel(’Bits  in  ROM’); 
ylabel ('Speed  (#  FAs)'); 
grid; 

(del  tromr.met 
meta  tromr; 
pause; 

axis ([2, 18, -1,15]); 
semilogy(xax,atrom) ; 
xlabel(’Bits  in  ROM'); 
ylabel (’Area-Time  (#  FAs)'); 
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grid; 

!del  atromr.met 
met a  atromr; 
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X  addcomp.m 

X  comparison  of  RNS  adders 

no*2 : 6 ; 

a*arom(no) ; 

t*trom(no) ; 

arom*a(3, :) ; 

aromcorr*no+a(2 , : ) ; 

adualadd*2*no ; 

aplot*[arom’  aromcorr’  adualadd’]; 
semilogy(no ,aplot) ; 
xlabel( 'Number  of  bits'); 
ylabel('Area  relative  to  full  adder'); 
labels* [’ROM  Lookup  ’; 

'Correction  ROM’; 

’ Dual  Adders  ’ ] ; 

text (4. 15*ones(aplot(l , : ) ) ,aplot(4, : )+[0  0  -6],  labels); 
pause ; 

trom=t(3, :) ; 
tromcorr*no+t(2, : )  ; 
tdualadd*no+ones(no) ; 
tplot=[trom’  tromcorr'  tdualadd’]; 
semilogy(no ,tplot) ; 
xlabelC’ Number  of  bits’); 
ylabelC’Time  relative  to  full  adder'); 
labels* [’ROM  Lookup  ’; 

'Correction  ROM’; 

’ Dual  Adders  ’ ] ; 

text (4. 65*ones(aplot (1 , : ) ) , tplot (5 , : )+ [0  0  -3],  labels); 
pause; 

atplot*aplot . *tplot ; 

semilogy(no.atplot) ; 

xlabel (’ Number  of  bits'); 

ylabelC’ Area-Time  relative  to  full  adder’); 

labels* [’ROM  Lookup  ’; 

'Correction  ROM’; 

' Dual  Adders  ’  ] ; 

text(4.65*ones(aplot(l, :)) ,atplot(5, :)+[-8000  0  -50],  labels); 
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grid; 

!del  addcomp . met 
met a  addcomp; 


X  mult comp. n 

X  comparison  of  RNS  multipliers 

n*2:8; 

a*arom(n) ; 

t*trom(n) ; 

arom*a(3 , : ) ; 

aind*3*a(l, : )+2*n; 

aqsm*6*n+2*a(l , : ) ; 

af a*3* (n-ones (n) ) . *n ; 

aqsm2*4*n+2  *  a ( 2 , : ) ; 

aplot*[arom’  aind’  aqsm’  afa'  aqsm2']; 
semilogy(n.aplot) ; 
xlabel (’ Number  of  bits'); 
ylabel('Area  relative  to  full  adder’); 
labels* [’ROM  Lookup  ' ; 

'Index  Multiplier’; 

'Quarter  Sq  Mult  '; 

'Full  Adder  Array’ ; 

’Modified  QSM  ’]  ; 

text(6.1*ones(aplot(l, :)) ,aplot(7, :)+[0  0000],  labels); 
grid; pause; 
trom*t(3, :) ; 

tind*n+ones(n)+2*t(l , : ) ; 
tqsm*2*(n+ones(n))+t(l , :) ; 
tfa=n.*n+n-ones(n) ; 
tqsm2*2*n+ones (n) +t (2 , : ) ; 
tplot*[trom'  tind’  tqsm’  tfa'  tqsm2']; 
semilogy(n,tplot) ; 
xlabel (’Number  of  bits’); 
ylabel(’Time  relative  to  full  adder’); 
labels* ['ROM  Lookup  '; 

'Index  Multiplier’; 

'Quarter  Sq  Mult  ’; 

'Full  Adder  Array'; 

’Modified  qSM  ; 

text(6.1*ones(tplot(l, :)) ,tplot(7, :)+[0  0000],  labels); 
grid; pause; 
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atplot*aplot . *tplot ; 

semilogy (a , atplot ) ; 

xlabel (’ Number  of  bits’); 

ylabel(’ Area-Time  relative  to  full  adder’); 

labels* [’ROM  Lookup  ’ ; 

’Index  Multiplier’; 

’ Quarter  Sq  Mult  ’ ; 

’ Full  Adder  Array ’ ; 

’Modified  qSM  ’] ; 

text (6 . l*ones (atplot (1 , : ) ) , [le6  le2  800  le4  2e3] ,  labels); 
grid; 

•del  mult comp. met 
met a  mult comp 
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%  rnsmcomp.m 

X  comparison  of  RNS  multipliers  to  integer  multipliers 
X  for  different  numbers  of  moduli 
1*4:7; 
b*5 : 35 ; 

1-1'; 

L*l*ones(b) ; 
ratio* ( (ones (1) . /l)*b) ; 
for  i*l : (size(l)*[l;0]) 
a»arom(ratio(i , :)); 
rnsa(i , : )=(4*ratio(i , : )+2*a(2, :)) ; 
t*trom(ratio(i , : )) ; 

rast(i, : )*2*ratio(i ,  :)+ones(b)+t(2, : )  ; 
end; 

rnsat*L . *rnsa . *rnst ; 

Xrn3at*L . *(6*ratio+2*0 . 122*exp(ratio*log(2 . 03) ) ) ; 

%rnsat*rnsat . *(2*(ratio+ones(ratio))+0. 0413*exp(ratio*log(l . 93) )) ; 
B*onas(l)*b; 

intat*(B.*B-1.5*B) .*B*2; 
atplot*rnsat . /intat ; 
plot(b.atplot) 

Xsemilogy(b,atplot) ; 
grid; 

ylabelC' Area-time  ratio,  RNS : Integer* ) ; 
xlabel(' Number  of  bits’); 
labels* C '4' ; 

’5’; 

-6’; 

’7’]; 

text(31*ones(atplot ( : , 1)) ’ , [ . 25  .08  .04  .01],  labels); 

!del  rnsmcomp.met 
meta  rnsmcomp 
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X  at comp. m 

X  area-time  comparison  for  DFT  computation 
X  ideal  selection  of  moduli 
dft*[63  252  504]; 
dftls*ones(dft) ; 

add* [1394  6584  14428];  X  #  of  additions  in  DFTs 

mult* [198  792  1584] ;  X  #  of  multiplications  in  DFTs 

1*[3  4  5  6  7]’;  X  #  of  moduli  (different  cases) 

lls*ones(l) ; 

bigls*lls*dftls; 

bd=8;  X  #  of  bits  in  data 

bc=8;  */,  #  of  bits  in  coefficients 

a=24;  X  #  of  trans  in  full  adder 

t*8;  X  RC  delay  through  full  adder 

b=(bd+bc)*dftls+(log(dft) ./log(2)) ; 

X 

for  i=l : (size(dft)*[0 ; 1] ) 
d*dft(i) ; 
adds*add(i) ; 
mults*mult(i) ; 
for  j*l : (size(l)*  [l ; 0] ) 

11-1 (j); 

X  find  bb 
bb«b(i)/ll; 

Xfind  at  for  dft  and  11 
area»ll*(2*bb) ; 
time*(bb+l) ; 

rnsaddat(j , i)»area*time; 
area*ll*(4*bb+2*aromm( (bb+1) ,bb)) ; 
trnsm*2*bb+l+tromm((bb+l) ,bb) ; 
rnsmultat(j , i)*area*trnsm; 

end 

end 

X 

intaddat*b. *b; 

X 

Xrnsaddat*2b(b/l+l)*a*t ; 
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7. 

intmultat*(b.*b-l .5*b) . *(2*b+dftls) ; 

X+(6/24)*(b.*b) .*(2*b+dftls) ; 

y. 

scadd*2* (1) *df t ; 
scmult*2*l*df t ; 
conadd»0*(l*dft) ; 
y.+3*2*(l-lls)*dft; 
conmult=0*(l*dft) ; 

X 

intat*intaddat .  *add+intmultat .  *mult ; 
for  i=l : (size(l)*[l;0]) 

for  j*l : (size(dft)*[0; 1] ) 

x(i , j)*aromm(bd,b(j) )*tromm(bd,b( j))*dft(j )*2; 
end 

end 

y=rnsaddat . *( (lls*add)+scadd+conadd) ; 

z=rnsmultat . *((lls*mult)+scmult+corunult) ; 

rnsat=x+y+z ; 

comp* Cintat '  rnsat ’ ] ; 

plot (dft, comp) ; 

Xtitle( 'Minimum  Area-Time  Products  for  WFTAs'); 

xlabelC'WFTA  Size'); 

ylabel( 'Minimum  A*T  C#FAs)'); 

labels* [' Int ' ; 

'3  '  ; 

'4  '  ; 

'  5  '  ; 

'  6  '  ; 

'7  ']; 

text (504*ones(comp(l , : ) ) ,comp(3 , :)+ [0  0  0  0  0  0],  labels); 
!del  at comp. met 
met a  at comp 
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X  atcomp2.m 

X  area-time  comparison  for  DFT  computation 
X  stricter  control  over  moduli  (integers) 
dft*C63  252  504]; 
dftls*ones(dft) ; 

add*[l394  6584  14428];  X  *  of  additions  in  DFT s 

mult*[l98  792  1584];  X  #  of  multiplications  in  DFTs 

1*[3  4567]';  X  #  of  moduli  (different  cases) 

lls*ones(l) ; 

bigls*lls*dftls; 

bd=8;  X  #  of  bits  in  data 

bc=8;  X  #  of  bits  in  coefficients 

a=24;  X  #  of  trans  in  full  adder 

t=8;  X  RC  delay  through  full  adder 

b=ceil((bd+bc)*dftls+(log(dft) ./log(2))) ; 

X 

for  i*l:3 

d=dft(i) ; 
adds=add(i) ; 
mults*mult(i) ; 
for  j«l: (size(l)*[l;0]) 

11=1 (j); 

X  find  bb(k) 
bb(l)*ceil(b(i)/ll)  ; 
bleft»b(i)-bb(l) ; 
for  k»2:l(j) 

bb(k)=ceil(bleft/(ll-k+l)) ; 
bleft*bleft-bb(k) ; 

end 

Xfind  at  for  d  and  11 

area*0 ; 

for  k*l :l(j) 

area*area+2*bb(k) ; 

end 

time»(bb(l)+l) ; 
msaddat(j  , i)»area*tirae; 
area«0 ; 
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for  k*l:l(j) 

szrnsm*4*bb(k)+2*aromm((bb(k)+l) ,bb(k)) ; 
area*area+szrnsm ; 

end 

trnsm*2*bb(l)+l+tromm((bb(l)+l)  ,bb(l))  ; 
rnsmultat ( j , i) =area*trnsm ; 

end 

end 

y. 

intaddat=b. *b; 

y. 

/!rasaddat=2b(b/l+l)*a*t ; 

y. 

intmultat=(b . *b-l .  5*b) .  *(2*b+dftls) ; 

‘/.+  (6/24)*(b . *b)  . *(2*b+dftls) ; 

y. 

scadd=2*(l)*dft; 
scmult*2*l*dft ; 
conadd=0*(l*dft) ; 
y.+3*2*(l-lls)*dft; 
conmult=0*(l*dft) ; 

•/. 

intat=intaddat . *add+intmultat . *mult ; 
for  i*l:(size(l)*[l;0]) 

for  j*l : (size(dft)*[0; 1] ) 

x(i , j)=aromm(bd,b(j))*tromm(bd)b(j))*dft(j)*2; 
end 

end 

y=rnsaddat .*((lls*add)+scadd+conadd)  ; 
z*rnsmultat .*((lls*mult)+scmult+conmult) ; 
rnsat=x+y+z ; 
comp»[intat’  rnsat']; 
plot (dft, comp) ; 

y,title( ’Minimum  Area-Time  Products  for  WFTAs'); 
xlabeK'WFTA  Size’); 
ylabeK’ Minimum  A*T  (#FAs)'); 
labels* C'lnt’ ; 
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’3  ' ; 

'4  ' ; 

’5  ' ; 

’6  ’ ; 

•7  ’]  ; 

text (504*ones(comp(l , : ) ) ,comp(3 , : )+  [0  0  0  0 
!del  atcomp2.met 
met a  atcomp2 


0  0] ,  labels) ; 
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X  atcomp3.m 

X  area-time  comparison  for  DFT  computation 
X  more  strict  control  of  moduli  ('fudge'  loss  in  dyn  range) 
dft*[63  252  504]; 

X  1008]; 

dftls*ones(dft) ; 
add* [1394  6584  14428]; 

X  34416]  ;  %  t  of  additions  in  DFTs 
mult*[l98  792  1584] ; 

X  3564]  ;  X  #  of  multiplications  in  DFTs 

1* (3  4  5  6  7]';  X  #  of  moduli  (different  cases) 

lls=ones(l)  ; 

bigls*lls*dftls ; 

bd*8;  X  #  of  bits  in  data 

bc*8;  y.  #  of  bit3  in  coefficients 

a=24 ;  %  #  of  trans  in  full  adder 

t=8,  */.  RC  delay  through  full  adder 

b=(bd+bc) *dftls+(log(dft) ./log(2)) ; 

rnsb=ceil(lls*b+(l . * [0  0  0.2  0.25  0 .43] ’ )*dftls) ; 

b*ceil(b) ; 

•4 

for  1*1 :  (siz«s(dft)*[0  l]  ’ ) 
d=dft(i) ; 
adds*add(i) ; 
mults*mult(i) ; 
for  j*l : (size(l)*[l  0]') 
ll*l(j) ; 

X  find  bb(k) 

bb(l)*ceil(rnsb(j ,i)/ll) ; 

y,bb(l)*ceil(2*bb(l)-log(exp(bb(l)*log(2))-3)/log(2)) ; 
bleft*rnsb(j ,i)-bb(l) ; 
for  k*2 :l(j) 

bb(k)*(bleft/(ll-k+l) ) ; 
if  bb(k)<2 , 
bb(k)*2; 
else 
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y,  z*log(exp(bb(k)*log(2))-3)/log(2) ; 
bb(k)=ceil(bb(k)) ; 

end 

bleft=bleft-bb(k) ; 

end 

Xfind  at  for  d  and  11 

area=0 ; 

for  k»l:l(j) 

area=area+2*bb(k) ; 

end 

time=(bb(l)+l) ; 
rnsaddat(j , i)=area*time ; 
area=0 ; 
for  k=l : 1 (j ) 

szrnsm=4*bb(k)+2*aromm((bb(k)+l) ,bb(k)) ; 
area^area+szrnsm ; 

end 

trnsm=2*bb(l)+l+tramm((bb(l)+l) ,bb(l)) ; 
rnsmultat ( j , i) =area*trnsm ; 

end 

end 

•/. 

intaddat=b . *b ; 

*/. 

Xrnsaddat=2b(b/l+l)*a*t ; 

*/. 

intmultat=(b . *b-l . 5*b) . *(2*b+df t Is) ; 

*/.+  (6/24)*(b.*b)  .*(2*b+dftls)  ; 

*/. 

scadd='2*l*dft; 
scmult»2*l*dft ; 
conadd*0*(l*dft) ; 

‘/.+3*2*(l-lls)*dft; 
conmult»0*(l*dft) ; 

X 

intat*intaddat . *add+intnultat . *mult ; 
for  i»l: (size(l)*[l;0]) 
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for  j=l : (size(dft)*[0; 1]) 

x(i, j)*aromm(bd,rnsb(i, j))*tromm(bd,rnsb(i, j))*dft(j)*2; 
end 

end 

y*rnsaddat . *((lls*add)+scadd+conadd) ; 
z=rnsmult at .*((11 s*mult ) + scmult +conmult ) ; 
rnsat*x+y+z; 
comp* [intat ’  rnsat ’ ] ; 

*/. 

%  Distributed 
N*[64  256  512]; 
logn*[6  8  9]  ; 
a=aromm(4, (4*(bc+l))) ; 
t=tromm(4, (4*(bc+l))) ; 

size*(N/2) . *(log(N)/log(2))*(a+(bc+l)*2) ; 
time=t*(bd*ones(N)+(bc+l)*(log(N)/log(2)) ) ; 
dist=size.*time; 

X 

plot (dft .comp, N.dist) ; 

'/.title ( ’Minimum  Area-Time  Products  for  WFTAs'); 

xlabel('WFTA  Size') ; 

ylabel ( 'Minimum  A*T  (#FAs)'); 

labels* C’ Int ' ; 

'3  ’  ; 

'4  '  ; 

'5  '  ; 

'6  '  ; 

’7  ']  ; 

text(504*ones(comp(l, :)) ,comp(3, :)+[0  00000],  labels); 
!del  atcomp3.met 
meta  atcomp3 
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X  atcomp4.m 

X  area-time  comparison  for  DFT  computation 
X  more  strict  control  of  moduli  (fewer  integers) 

X  smaller  range  for  integers 
df  t* [63  252  504]; 

X  1008]; 

dftls*ones(dft) ; 
add* [1394  6584  14428]; 

X  34416] ;  X  #  of  additions  in  DFTs 
mult* [198  792  1584]; 

7.  3564]  ;  '/,  #  of  multiplications  in  DFTs 

1=[3  4  5  6  7]’;  '/,  #  of  moduli  (different  cases) 

lls=ones(l) ; 

bigls=lls*dftls; 

bd=8;  X  #  of  bits  in  data 

bc=8;  X  #  of  bits  in  coefficients 

a=24;  X  #  of  trans  in  full  adder 

t=8 ;  X  RC  delay  through  full  adder 

b=(bd+bc) *dftls+(log(dft) ./log(2)) ; 

rnsb=ceil(lls*b+(l . * [0  0  0.2  0.25  0 .43] ’ )*dftls) ; 

X 

for  i*l : (size(dft)*[0  l] ’ ) 
d=dft(i) ; 
adds*add(i) ; 
mults=mult(i) ; 
for  j*l : (size(l)*[l  0]’) 

ii-Kj); 

X  find  bb(k) 

bb(l)*ceil(rnsb(j ,i)/ll) ; 

y,bb(l)*ceil(2*bb(l)-log(exp(bb(l)*log(2))-3)/log(2)) ; 
bleft*msb(j  ,i)-bb(l) ; 
for  k*2:l(j) 

bb(k)*(bleft/(ll-k+l)) ; 
if  bb(k)<2, 
bb(k)-2; 
else 
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*/.  z*log(exp(bb(k)*log(2))-3)/log(2) ; 
bb(k)aceil(bb(k) ) ; 

end 

bleft*bleft-bb(k) ; 

end 

Xfind  at  for  d  and  11 

area*0 ; 

for  k=l:l(j) 

area=area+2*bb(k) ; 

end 

time»(bb(l)+l) ; 
rnsaddat(j , i)=area*time ; 
area=0 ; 
for  k*l:l(j) 

szrnsm»4*bb(k)+2*aromm((bb(k)+l) ,bb(k)) ; 
area=area+szmsm ; 

end 

trnsm=*2*bb(l)  +  l+tromin((bb(l)+l)  ,bb(l)) ; 
rnsmultat(j , i)=area*trnsm; 

end 

end 

b=ceil(bd*dftls+max((log(dft)/log(2)) , (bc*dftls)))  ; 

*/. 

intaddat=b. *b; 

7. 

7rnsaddat=2b(b/l+l)*a*t ; 

7. 

intmultat=(b . *b-l ,5*b) . *(2*b+dftls) ; 
y.+  (6/24)*(b.*b)  .*(2*b+dftls) ; 

7. 

scadd»2*(l)*dft; 

sanult»2*l*dft ; 
conadd»0*(l*dft) ; 
y.+3*2*(l-lls)*dft; 
conmult»0*(l*dft) ; 

7. 

intat*intaddat . *add+intmultat . *mult ; 
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for  i*l: (size(l)*[l;0]) 

for  j*l : (size(dft)*[0; 1] ) 

x ( i , j ) *aromm (bd , rnsb ( i , j ) ) *t romm (bd , rnsb ( i , j ) ) *df  t ( j ) *2 ; 
end 

end 

y*rnsaddat . *((lls*add)+scadd+conadd) ; 

z*rnsmultat . *( (lls*mult) +scmult+conmult) ; 

rnsat*x+y+z; 

comp= [intat ’  rnsat ’ ] ; 

plot (dft, comp) ; 

Xtitle( 'Minimum  Area-Time  Products  for  WFTAs’); 

xlabel ( ' WFTA  Size'); 

ylabel( 'Minimum  A*T  (#FAs)'); 

labels* [ ’ Int ' ; 

'3  ’  ; 

'4  '  ; 

■5  ’  ; 

'6  ’ ; 

’7  ']  ; 

text (504*ones(comp(l , : ) ) ,comp(3 , : )+[0  00000],  labels); 
!del  atcomp4.met 
met a  atcomp4 
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